Arrow Research search

Author name cluster

Jacob Devlin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
2 author rows

Possible papers

4

JMLR Journal 2024 Journal Article

Scaling Instruction-Finetuned Language Models

  • Hyung Won Chung
  • Le Hou
  • Shayne Longpre
  • Barret Zoph
  • Yi Tay
  • William Fedus
  • Yunxuan Li
  • Xuezhi Wang

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks (at time of release), such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

JMLR Journal 2023 Journal Article

PaLM: Scaling Language Modeling with Pathways

  • Aakanksha Chowdhery
  • Sharan Narang
  • Jacob Devlin
  • Maarten Bosma
  • Gaurav Mishra
  • Adam Roberts
  • Paul Barham
  • Hyung Won Chung

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM). We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2017 Conference Paper

Neural Program Meta-Induction

  • Jacob Devlin
  • Rudy Bunel
  • Rishabh Singh
  • Matthew Hausknecht
  • Pushmeet Kohli

Most recently proposed methods for Neural Program induction work under the assumption of having a large set of input/output (I/O) examples for learning any given input-output mapping. This paper aims to address the problem of data and computation efficiency of program induction by leveraging information from related tasks. Specifically, we propose two novel approaches for cross-task knowledge transfer to improve program induction in limited-data scenarios. In our first proposal, portfolio adaptation, a set of induction models is pretrained on a set of related tasks, and the best model is adapted towards the new task using transfer learning. In our second approach, meta program induction, a $k$-shot learning approach is used to make a model generalize to new tasks without additional training. To test the efficacy of our methods, we constructed a new benchmark of programs written in the Karel programming language. Using an extensive experimental evaluation on the Karel benchmark, we demonstrate that our proposals dramatically outperform the baseline induction method that does not use knowledge transfer. We also analyze the relative performance of the two approaches and study conditions in which they perform best. In particular, meta induction outperforms all existing approaches under extreme data sparsity (when a very small number of examples are available), i. e. , fewer than ten. As the number of available I/O examples increase (i. e. a thousand or more), portfolio adapted program induction becomes the best approach. For intermediate data sizes, we demonstrate that the combined method of adapted meta program induction has the strongest performance.

ICML Conference 2017 Conference Paper

RobustFill: Neural Program Learning under Noisy I/O

  • Jacob Devlin
  • Jonathan Uesato
  • Surya Bhupatiraju
  • Rishabh Singh
  • Abdel-rahman Mohamed
  • Pushmeet Kohli

The problem of automatically generating a computer program from some specification has been studied since the early days of AI. Recently, two competing approaches for `automatic program learning’ have received significant attention: (1) `neural program synthesis’, where a neural network is conditioned on input/output (I/O) examples and learns to generate a program, and (2) `neural program induction’, where a neural network generates new outputs directly using a latent program representation. Here, for the first time, we directly compare both approaches on a large-scale, real-world learning task and we additionally contrast to rule-based program synthesis, which uses hand-crafted semantics to guide the program generation. Our neural models use a modified attention RNN to allow encoding of variable-sized sets of I/O pairs, which achieve 92\% accuracy on a real-world test set, compared to the 34\% accuracy of the previous best neural synthesis approach. The synthesis model also outperforms a comparable induction model on this task, but we more importantly demonstrate that the strength of each approach is highly dependent on the evaluation metric and end-user application. Finally, we show that we can train our neural models to remain very robust to the type of noise expected in real-world data (e. g. , typos), while a highly-engineered rule-based system fails entirely.