Author name cluster

David Krueger

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
David Krueger
Ekdeep S Lubana
Dmitrii Krasheninnikov

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting ``high-stakes'' interactions---where the text indicates that the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. These savings are enabled by reusing activations of the model that is being monitored. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and the codebase at \url{https: //github. com/arrrlex/models-under-pressure}.

PDF Details

NeurIPS Conference 2025 Conference Paper

Distributional Training Data Attribution: What do Influence Functions Sample?

Bruno Mlodozeniec
Isaac Reid
Sam Power
David Krueger
Murat Erdogdu
Richard Turner
Roger Grosse

Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing distributional training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. Intriguingly, we find that influence functions (IFs), a popular data attribution tool, are 'secretly distributional': they emerge from our framework as the limit to unrolled differentiation, without requiring restrictive convexity assumptions. This provides a new perspective on the effectiveness of IFs in deep learning. We demonstrate the practical utility of d-TDA in experiments, including improving data pruning for vision transformers and identifying influential examples with diffusion models.

PDF Details

NeurIPS Conference 2025 Conference Paper

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

Shoaib Ahmed Siddiqui
Adrian Weller
David Krueger
Gintare Karolina Dziugaite
Michael Mozer
Eleni Triantafillou

Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50\% post-unlearning to nearly 100\% with fine-tuning on just the *retain* set---i. e. , zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50\%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

PDF Details

RLJ Journal 2025 Journal Article

Mitigating Goal Misgeneralization via Minimax Regret

Karim Abdel Sadek
Matthew Farrugia-Roberts
Usman Anwar
Hannah Erlebach
Christian Schroeder de Witt
David Krueger
Michael D Dennis

Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer’s intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates the risk that policies will behave as if in pursuit of the proxy goal, rather than the intended goal, in deployment—a phenomenon known as goal misgeneralization. In this paper, we formalize this problem setting in order to theoretically study the possibility of goal misgeneralization under different training objectives. We show that goal misgeneralization is possible under approximate optimization of the maximum expected value (MEV) objective, but not the minimax expected regret (MMER) objective. We then empirically show that the standard MEV-based training method of domain randomization exhibits goal misgeneralization in procedurally-generated grid-world environments, whereas current regret-based unsupervised environment design (UED) methods are more robust to goal misgeneralization (though they don’t find MMER policies in all cases). Our findings suggest that minimax expected regret is a promising approach to mitigating goal misgeneralization.

PDF Details

RLC Conference 2025 Conference Paper

Mitigating Goal Misgeneralization via Minimax Regret

Karim Abdel Sadek
Matthew Farrugia-Roberts
Usman Anwar
Hannah Erlebach
Christian Schroeder de Witt
David Krueger
Michael D Dennis

PDF Details

TMLR Journal 2025 Journal Article

Permissive Information-Flow Analysis for Large Language Models

Shoaib Ahmed Siddiqui
Radhika Gaonkar
Boris Köpf
David Krueger
Andrew Paverd
Ahmed Salem
Shruti Tople
Lukas Wutschitz

Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. Assuming each piece of information comes with an additional meta-label (such as low/high integrity labels), one promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the input labels that were \emph{influential} in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a $k$-nearest-neighbors language model. We compare these with a baseline that uses introspection to predict the output label. Our experimental results in an LLM agent setting show that our label propagator assigns a more permissive label over the baseline in more than 85% of the cases, which underscores the practicality of our approach.

PDF Details

TMLR Journal 2025 Journal Article

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Jan Wehner
Sahar Abdelnabi
Daniel Tan
David Krueger
Mario Fritz

Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

PDF Details

TMLR Journal 2025 Journal Article

Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens

Usman Anwar
Johannes von Oswald
Louis Kirsch
David Krueger
Spencer Frei

In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate the adversarial robustness of in-context learning in transformers to hijacking attacks — a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training --- done either at the pretraining or finetuning stage --- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.

PDF Details

NeurIPS Conference 2024 Conference Paper

A Generative Model of Symmetry Transformations

James U. Allingham
Bruno K. Mlodozeniec
Shreyas Padhy
Javier Antorán
David Krueger
Richard E. Turner
Eric Nalisnick
José M. Hernández-Lobato

Correctly capturing the symmetry transformations of data can lead to efficient models with strong generalization capabilities, though methods incorporating symmetries often require prior knowledge. While recent advancements have been made in learning those symmetries directly from the dataset, most of this work has focused on the discriminative setting. In this paper, we take inspiration from group theoretic ideas to construct a generative model that explicitly aims to capture the data's approximate symmetries. This results in a model that, given a prespecified broad set of possible symmetries, learns to what extent, if at all, those symmetries are actually present. Our model can be seen as a generative process for data augmentation. We provide a simple algorithm for learning our generative model and empirically demonstrate its ability to capture symmetries under affine and color transformations, in an interpretable way. Combining our symmetry model with standard generative models results in higher marginal test-log-likelihoods and improved data efficiency.

PDF Details DOI

TMLR Journal 2024 Journal Article

Blockwise Self-Supervised Learning at Scale

Shoaib Siddiqui
David Krueger
Yann LeCun
Stephane Deny

Current state-of-the-art deep networks are all powered by backpropagation. However, long backpropagation paths as found in end-to-end training are biologically implausible, as well as inefficient in terms of energy consumption. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48\%, only 1.1\% below the accuracy of an end-to-end pretrained network (71.57\% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

PDF Details

TMLR Journal 2024 Journal Article

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Usman Anwar
Abulhair Saparov
Javier Rando
Daniel Paleka
Miles Turpin
Peter Hase
Ekdeep Singh Lubana
Erik Jenner

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions.

PDF Details

NeurIPS Conference 2024 Conference Paper

Interpreting Learned Feedback Patterns in Large Language Models

Luke Marks
Amir Abdullah
Clement Neo
Rauno Arike
David Krueger
Philip Torr
Fazl Barez

Reinforcement learning from human feedback (RLHF) is widely used to train large language models (LLMs). However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term Learned Feedback Pattern (LFP) for patterns in an LLM's activations learned during RLHF that improve its performance on the fine-tuning task. We hypothesize that LLMs with LFPs accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF. To test this, we train probes to estimate the feedback signal implicit in the activations of a fine-tuned LLM. We then compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe's predictions. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs. Understanding LFPs can help minimize discrepancies between LLM behavior and training objectives, which is essential for the safety and alignment of LLMs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Predicting Future Actions of Reinforcement Learning Agents

Stephen Chung
Scott Niekum
David Krueger

As reinforcement learning agents become increasingly deployed in real-world scenarios, predicting future agent actions and events during deployment is important for facilitating better human-agent interaction and preventing catastrophic outcomes. This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non-planning. We employ two approaches: the inner state approach, which involves predicting based on the inner computations of the agents (e. g. , plans or neuron activations), and a simulation-based approach, which involves unrolling the agent in a learned world model. Our results show that the plans of explicitly planning agents are significantly more informative for prediction than the neuron activations of the other types. Furthermore, using internal plans proves more robust to model quality compared to simulation-based approaches when predicting actions, while the results for event prediction are more mixed. These findings highlight the benefits of leveraging inner states and simulations to predict future agent actions and events, thereby improving interaction and safety in real-world deployments.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Stress-Testing Capability Elicitation With Password-Locked Models

Ryan Greenblatt
Fabien Roger
Dmitrii Krasheninnikov
David Krueger

To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM’s full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities. Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models but may be unreliable when high-quality demonstrations are not available, e. g. , as may be the case when models’ (hidden) capabilities exceed those of human demonstrators.

PDF Details DOI

TMLR Journal 2023 Journal Article

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper
Xander Davies
Claudia Shi
Thomas Krendl Gilbert
Jérémy Scheurer
Javier Rando
Rachel Freedman
Tomek Korbak

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-layered approach to the development of safer AI systems.

PDF Details

NeurIPS Conference 2023 Conference Paper

Thinker: Learning to Plan and Act

Stephen Chung
Ivan Anokhin
David Krueger

We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments.

PDF Details

NeurIPS Conference 2022 Conference Paper

Defining and Characterizing Reward Gaming

Joar Skalse
Nikolaus Howe
Dmitrii Krasheninnikov
David Krueger

We provide the first formal definition of \textbf{reward hacking}, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$. We say that a proxy is \textbf{unhackable} if increasing the expected proxy return can never decrease the expected true return. Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it ``narrower'') or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case. A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

PDF Details

RLDM Conference 2019 Conference Abstract

MISLEADING META-OBJECTIVES AND HIDDEN INCENTIVES FOR DIS- TRIBUTIONAL SHIFT

David Krueger
Tegan Maharaj
Shane Legg
Jan Leike

Decisions made by machine learning systems have a tremendous influence on the world. Yet it is common for machine learning algorithms to assume that no such influence exists. An example is the use of the i. i. d. assumption in online learning for applications such as content recommendation, where the (choice of) content displayed can change users’ perceptions and preferences, or even drive them away, causing a shift in the distribution of users. A large body of work in reinforcement learning and causal machine learning aims to account for distributional shift caused by deploying a learning system previously trained offline. Our goal is similar, but distinct: we point out that online training with meta-learning can create a hidden incentive for a learner to cause distributional shift. We design a simple environment to test for these hidden incentives (HIDS), demonstrate the potential for this phenomenon to cause unexpected or undesirable behavior, and propose and validate a mitigation strategy.

PDF Details