Arrow Research search

Author name cluster

Tomasz Korbak

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

ICLR Conference 2024 Conference Paper

Compositional Preference Models for Aligning LMs

  • Dongyoung Go
  • Tomasz Korbak
  • Germán Kruszewski
  • Jos Rozen
  • Marc Dymetman

As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.

TMLR Journal 2024 Journal Article

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

  • Usman Anwar
  • Abulhair Saparov
  • Javier Rando
  • Daniel Paleka
  • Miles Turpin
  • Peter Hase
  • Ekdeep Singh Lubana
  • Erik Jenner

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions.

TMLR Journal 2024 Journal Article

Learning from Natural Language Feedback

  • Angelica Chen
  • Jérémy Scheurer
  • Jon Ander Campos
  • Tomasz Korbak
  • Jun Shern Chan
  • Samuel R. Bowman
  • Kyunghyun Cho
  • Ethan Perez

The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the target distribution and demonstrate proof-of-concepts on text summarization and program synthesis tasks. For code generation, ILF improves a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. For summarization, we show that ILF can be combined with learning from human preferences to improve a GPT-3 model's summarization performance to be comparable to human quality, outperforming fine-tuning on human-written summaries. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on a variety of tasks.

NeurIPS Conference 2024 Conference Paper

Many-shot Jailbreaking

  • Cem Anil
  • Esin Durmus
  • Nina Panickssery
  • Mrinank Sharma
  • Joe Benton
  • Sandipan Kundu
  • Joshua Batson
  • Meg Tong

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This attack is newly feasible with the larger context windows recently deployed by language model providers like Google DeepMind, OpenAI and Anthropic. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

ICLR Conference 2024 Conference Paper

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

  • Lukas Berglund
  • Meg Tong
  • Maximilian Kaufmann
  • Mikita Balesni
  • Asa Cooper Stickland
  • Tomasz Korbak
  • Owain Evans

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form ''_A_ is _B_'', it will not automatically generalize to the reverse direction ''_B_ is _A_''. This is the **Reversal Curse**. For instance, if a model is trained on ''Valentina Tereshkova was the first woman to travel to space'', it will not automatically be able to answer the question, ''Who was the first woman to travel to space?''. Moreover, the likelihood of the correct answer (''Valentina Tershkova'') will not be higher than for a random name. Thus, models do not generalize a prevalent pattern in their training set: if ''_A_ is _B_'' occurs, ''_B_ is _A_'' is more likely to occur. It is worth noting, however, that if ''_A_ is _B_'' appears _in-context_, models can deduce the reverse relationship. We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as ''Uriah Hawthorne is the composer of _Abyssal Melodies_'' and showing that they fail to correctly answer ''Who composed _Abyssal Melodies?_''. The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as ''Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]'' and the reverse ''Who is Mary Lee Pfeiffer's son?''. GPT-4 correctly answers questions like the former 79\% of the time, compared to 33\% for the latter. Code available at: https://github.com/lukasberglund/reversal_curse.

ICLR Conference 2024 Conference Paper

Towards Understanding Sycophancy in Language Models

  • Mrinank Sharma
  • Meg Tong
  • Tomasz Korbak
  • David Duvenaud
  • Amanda Askell
  • Samuel R. Bowman
  • Esin Durmus
  • Zac Hatfield-Dodds

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgments are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses.

ICML Conference 2023 Conference Paper

Aligning Language Models with Preferences through f-divergence Minimization

  • Dongyoung Go
  • Tomasz Korbak
  • Germán Kruszewski
  • Jos Rozen
  • Nahyeon Ryu
  • Marc Dymetman

Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, $f$-DPG, which allows the use of any $f$-divergence to approximate any target distribution that can be evaluated. $f$-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences present different alignment and diversity trade-offs. We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work. These distinguishing characteristics between divergences persist as the model size increases, highlighting the importance of selecting appropriate divergence objectives.

TMLR Journal 2023 Journal Article

Inverse Scaling: When Bigger Isn't Better

  • Ian R. McKenzie
  • Alexander Lyzhov
  • Michael Martin Pieler
  • Alicia Parrish
  • Aaron Mueller
  • Ameya Prabhu
  • Euan McLean
  • Xudong Shen

Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.

ICML Conference 2023 Conference Paper

Pretraining Language Models with Human Preferences

  • Tomasz Korbak
  • Kejian Shi
  • Angelica Chen
  • Rasika Vinayak Bhalerao
  • Christopher L. Buckley
  • Jason Phang
  • Samuel R. Bowman
  • Ethan Perez

Language models (LMs) are pretrained to imitate text from large and diverse datasets that contain content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, among others. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i. e. , learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.

ICML Conference 2022 Conference Paper

Controlling Conditional Language Models without Catastrophic Forgetting

  • Tomasz Korbak
  • Hady Elsahar
  • Germán Kruszewski
  • Marc Dymetman

Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e. g. , hallucinations in abstractive summarization or style violations in code generation). This raises the important question of how to adapt pre-trained generative models to meet all requirements without destroying their general capabilities ("catastrophic forgetting"). Recent work has proposed to solve this problem by representing task-specific requirements through energy-based models (EBMs) and approximating these EBMs using distributional policy gradients (DPG). Despite its effectiveness, this approach is however limited to unconditional distributions. In this paper, we extend DPG to conditional tasks by proposing Conditional DPG (CDPG). We evaluate CDPG on four different control objectives across three tasks (translation, summarization and code generation) and two pretrained models (T5 and GPT-Neo). Our results show that fine-tuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and — in contrast with baseline approaches — does not result in catastrophic forgetting.

NeurIPS Conference 2022 Conference Paper

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

  • Tomasz Korbak
  • Hady Elsahar
  • Germán Kruszewski
  • Marc Dymetman

The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a "training from scratch" to a "fine-tuning'' paradigm. While in some applications the goal is to "nudge'' the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms and show that methods such as KL-control developed in the RM paradigm can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability, and sample efficiency.

NeurIPS Conference 2021 Conference Paper

Catalytic Role Of Noise And Necessity Of Inductive Biases In The Emergence Of Compositional Communication

  • Łukasz Kuciński
  • Tomasz Korbak
  • Paweł Kołodziej
  • Piotr Miłoś

Communication is compositional if complex signals can be represented as a combination of simpler subparts. In this paper, we theoretically show that inductive biases on both the training framework and the data are needed to develop a compositional communication. Moreover, we prove that compositionality spontaneously arises in the signaling games, where agents communicate over a noisy channel. We experimentally confirm that a range of noise levels, which depends on the model and the data, indeed promotes compositionality. Finally, we provide a comprehensive study of this dependence and report results in terms of recently studied compositionality metrics: topographical similarity, conflict count, and context independence.