Arrow Research search

Author name cluster

Ethan Perez

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

ICLR Conference 2025 Conference Paper

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

  • Jiaxin Wen
  • Vivek Hebbar
  • Caleb Larson
  • Aryan Bhatt
  • Ansh Radhakrishnan
  • Mrinank Sharma
  • Henry Sleight
  • Shi Feng 0005

As large language models (LLMs) grow more powerful, they also become more difficult to trust. They could be either aligned with human intentions, or exhibit "subversive misalignment" -- introducing subtle errors that bypass safety checks. Although individual errors may not immediately cause harm, each increases the risk of an eventual safety failure. With this uncertainty, model deployment often grapples with the tradeoff between ensuring safety and harnessing the capabilities of untrusted models. In this work, we introduce the ``Diffuse Risk Management'' problem, aiming to balance the average-case safety and usefulness in the deployment of untrusted models over a large sequence of tasks. We approach this problem by developing a two-level framework: the single-task level (micro-protocol) and the whole-scenario level (macro-protocol). At the single-task level, we develop various \textit{micro}-protocols that use a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. At the whole-scenario level, we find an optimal \textit{macro}-protocol that uses an adaptive estimate of the untrusted model's risk to choose between micro-protocols. To evaluate the robustness of our method, we follow \textit{control evaluations} in a code generation testbed, which involves a red team attempting to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. Experiment results show that our approach retains 99.6\% usefulness of the untrusted model while ensuring near-perfect safety, significantly outperforming existing deployment methods. Our approach also demonstrates robustness when the trusted and untrusted models have a large capability gap. Our findings demonstrate the promise of managing diffuse risks in the deployment of increasingly capable but untrusted LLMs.

NeurIPS Conference 2025 Conference Paper

Best-of-N Jailbreaking

  • John Hughes
  • Sara Price
  • Aengus Lynch
  • Rylan Schaeffer
  • Fazl Barez
  • Arushi Somani
  • Sanmi Koyejo
  • Henry Sleight

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations---such as random shuffling or capitalization for textual prompts---until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3. 5 Sonnet when sampling 10, 000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers and reasoning models like o1. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1. 5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks---combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

ICLR Conference 2025 Conference Paper

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

  • Rylan Schaeffer
  • Dan Valentine
  • Luke Bailey
  • James Chua
  • Cristóbal Eyzaguirre
  • Zane Durante
  • Joe Benton
  • Brando Miranda

The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of "highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

TMLR Journal 2025 Journal Article

Inverse Scaling in Test-Time Compute

  • Aryo Pradipta Gema
  • Alexander Hägele
  • Runjin Chen
  • Andy Arditi
  • Jacob Goldman-Wetzler
  • Kit Fraser-Taliente
  • Henry Sleight
  • Linda Petrini

We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

ICLR Conference 2025 Conference Paper

Language Models Learn to Mislead Humans via RLHF

  • Jiaxin Wen
  • Ruiqi Zhong
  • Akbir Khan
  • Ethan Perez
  • Jacob Steinhardt
  • Minlie Huang
  • Samuel R. Bowman
  • He He 0001

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it ``U-Sophistry'' since it is \textbf{U}nintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting \textbf{I}ntended Sophistry (e.g.~backdoored LMs), does not generalize to U-Sophistry. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.

TMLR Journal 2025 Journal Article

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

  • Abhay Sheshadri
  • Aidan Ewart
  • Phillip Huang Guo
  • Aengus Lynch
  • Cindy Wu
  • Vivek Hebbar
  • Henry Sleight
  • Asa Cooper Stickland

Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

ICLR Conference 2025 Conference Paper

Looking Inward: Language Models Can Learn About Themselves by Introspection

  • Felix Jedidja Binder
  • James Chua
  • Tomek Korbak
  • Henry Sleight
  • John Hughes
  • Robert Long
  • Ethan Perez
  • Miles Turpin

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g. thoughts and feelings) that are not accessible to external observers. Do LLMs have this introspective capability of privileged access? If they do, this would show that LLMs can acquire knowledge not contained in or inferable from training data. We investigate LLMs predicting properties of their own behavior in hypothetical situations. If a model M1 has this capability, it should outperform a different model M2 in predicting M1's behavior—even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models, we find that the model M1 outperforms M2 in predicting itself, providing evidence for privileged access. Further experiments and ablations provide additional evidence. Our results show that LLMs can offer reliable self-information independent of external data in certain domains. By demonstrating this, we pave the way for further work on introspection in more practical domains, which would have significant implications for model transparency and explainability. However, while we successfully show introspective capabilities in simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

NeurIPS Conference 2025 Conference Paper

Quantifying Elicitation of Latent Capabilities in Language Models

  • Elizabeth Donoway
  • Hailey Joren
  • Arushi Somani
  • Henry Sleight
  • Julian Michael
  • Michael Deweese
  • John Schulman
  • Ethan Perez

Large language models often possess latent capabilities that lie dormant unless explicitly elicited, or surfaced, through fine-tuning or prompt engineering. Predicting, assessing, and understanding these latent capabilities pose significant challenges in the development of effective, safe AI systems. In this work, we recast elicitation as an information-constrained fine-tuning problem and empirically characterize upper bounds on the minimal number of parameters needed to achieve specific task performances. We find that training as few as 10–100 randomly chosen parameters—several orders of magnitude fewer than state-of-the-art parameter-efficient methods—can recover up to 50\% of the performance gap between pretrained-only and full fine-tuned models, and 1, 000s to 10, 000s of parameters can recover 95\% of this performance gap. We show that a logistic curve fits the relationship between the number of trained parameters and model performance gap recovery. This scaling generalizes across task formats and domains, as well as model sizes and families, extending to reasoning models and remaining robust to increases in inference compute. To help explain this behavior, we consider a simplified picture of elicitation via fine-tuning where each trainable parameter serves as an encoding mechanism for accessing task-specific knowledge. We observe a relationship between the number of trained parameters and how efficiently relevant model capabilities can be accessed and elicited, offering a potential route to distinguish elicitation from teaching.

ICML Conference 2024 Conference Paper

Debating with More Persuasive LLMs Leads to More Truthful Answers

  • Akbir Khan
  • John Hughes
  • Dan Valentine
  • Laura Ruis
  • Kshitij Sachan
  • Ansh Radhakrishnan
  • Edward Grefenstette
  • Samuel R. Bowman

Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.

TMLR Journal 2024 Journal Article

Learning from Natural Language Feedback

  • Angelica Chen
  • Jérémy Scheurer
  • Jon Ander Campos
  • Tomasz Korbak
  • Jun Shern Chan
  • Samuel R. Bowman
  • Kyunghyun Cho
  • Ethan Perez

The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the target distribution and demonstrate proof-of-concepts on text summarization and program synthesis tasks. For code generation, ILF improves a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. For summarization, we show that ILF can be combined with learning from human preferences to improve a GPT-3 model's summarization performance to be comparable to human quality, outperforming fine-tuning on human-written summaries. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on a variety of tasks.

NeurIPS Conference 2024 Conference Paper

Many-shot Jailbreaking

  • Cem Anil
  • Esin Durmus
  • Nina Panickssery
  • Mrinank Sharma
  • Joe Benton
  • Sandipan Kundu
  • Joshua Batson
  • Meg Tong

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This attack is newly feasible with the larger context windows recently deployed by language model providers like Google DeepMind, OpenAI and Anthropic. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

ICLR Conference 2024 Conference Paper

Towards Understanding Sycophancy in Language Models

  • Mrinank Sharma
  • Meg Tong
  • Tomasz Korbak
  • David Duvenaud
  • Amanda Askell
  • Samuel R. Bowman
  • Esin Durmus
  • Zac Hatfield-Dodds

Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgments are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses.

ICLR Conference 2024 Conference Paper

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

  • Juan Rocamonde
  • Victoriano Montesinos
  • Elvis Nava
  • Ethan Perez
  • David Lindner

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide _a single sentence text prompt_ describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second "baseline" prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

TMLR Journal 2023 Journal Article

Inverse Scaling: When Bigger Isn't Better

  • Ian R. McKenzie
  • Alexander Lyzhov
  • Michael Martin Pieler
  • Alicia Parrish
  • Aaron Mueller
  • Ameya Prabhu
  • Euan McLean
  • Xudong Shen

Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.

NeurIPS Conference 2023 Conference Paper

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

  • Miles Turpin
  • Julian Michael
  • Ethan Perez
  • Samuel Bowman

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs—e. g. , by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)"—which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations rationalizing those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3. 5 from OpenAI and Claude 1. 0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods.

ICML Conference 2023 Conference Paper

Pretraining Language Models with Human Preferences

  • Tomasz Korbak
  • Kejian Shi
  • Angelica Chen
  • Rasika Vinayak Bhalerao
  • Christopher L. Buckley
  • Jason Phang
  • Samuel R. Bowman
  • Ethan Perez

Language models (LMs) are pretrained to imitate text from large and diverse datasets that contain content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, among others. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i. e. , learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.

ICML Conference 2021 Conference Paper

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

  • Ethan Perez
  • Douwe Kiela
  • Kyunghyun Cho

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels’ minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

NeurIPS Conference 2021 Conference Paper

True Few-Shot Learning with Language Models

  • Ethan Perez
  • Douwe Kiela
  • Kyunghyun Cho

Pretrained language models (LMs) perform well on many tasks even when learning from a few examples, but prior work uses many held-out examples to tune various aspects of learning, such as hyperparameters, training objectives, and natural language templates ("prompts"). Here, we evaluate the few-shot ability of LMs when such held-out examples are unavailable, a setting we call true few-shot learning. We test two model selection criteria, cross-validation and minimum description length, for choosing LM prompts and hyperparameters in the true few-shot setting. On average, both marginally outperform random selection and greatly underperform selection based on held-out examples. Moreover, selection criteria often prefer models that perform significantly worse than randomly-selected ones. We find similar results even when taking into account our uncertainty in a model's true performance during selection, as well as when varying the amount of computation and number of examples used for selection. Overall, our findings suggest that prior work significantly overestimated the true few-shot ability of LMs given the difficulty of few-shot model selection.

NeurIPS Conference 2020 Conference Paper

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  • Patrick Lewis
  • Ethan Perez
  • Aleksandra Piktus
  • Fabio Petroni
  • Vladimir Karpukhin
  • Naman Goyal
  • Heinrich Küttler
  • Mike Lewis

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

AAAI Conference 2018 Conference Paper

FiLM: Visual Reasoning with a General Conditioning Layer

  • Ethan Perez
  • Florian Strub
  • Harm de Vries
  • Vincent Dumoulin
  • Aaron Courville

We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning — answering image-related questions which require a multi-step, high-level process — a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-theart error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.