Arrow Research search

Author name cluster

Noah Goodman

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers
1 author row

Possible papers

40

TMLR Journal 2026 Journal Article

Large Language Model Reasoning Failures

  • Peiyang Song
  • Pengrui Han
  • Noah Goodman

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.

JMLR Journal 2025 Journal Article

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

  • Atticus Geiger
  • Duligur Ibeling
  • Amir Zur
  • Maheep Chaudhary
  • Sonakshi Chauhan
  • Jing Huang
  • Aryaman Arora
  • Zhengxuan Wu

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

TMLR Journal 2025 Journal Article

Emergent Symbol-like Number Variables in Artificial Neural Networks

  • Satchel Grant
  • Noah Goodman
  • James Lloyd McClelland

What types of numeric representations emerge in neural systems, and what would a satisfying answer to this question look like? In this work, we interpret Neural Network (NN) solutions to sequence based number tasks through a variety of methods to understand how well we can interpret them through the lens of interpretable Symbolic Algorithms (SAs)—precise algorithms describable by rules operating on typed, mutable variables. We use GRUs, LSTMs, and Transformers trained using Next Token Prediction (NTP) on tasks where the correct tokens depend on numeric information only latent in the task structure. We show through multiple causal and theoretical methods that we can interpret raw NN activity through the lens of simplified SAs when we frame the neural activity in terms of neural subspaces rather than individual neurons. Using Distributed Alignment Search (DAS), we find that, depending on network architecture, dimensionality, and task specifications, alignments with SA’s can be very high, while other times they can be only approximate, or fail altogether. We extend our analytic toolkit to address the failure cases by expanding the DAS framework to a broader class of alignment functions that more flexibly capture NN activity in terms of interpretable variables from SAs, and we provide theoretic and empirical explorations of Linear Alignment Functions (LAFs) in contrast to the preexisting Orthogonal Alignment Functions (OAFs). Through analyses of specific cases we confirm the usefulness of causal interventions on neural subspaces for NN interpretability, and we show that recurrent models can develop graded, symbol-like number variables within their neural activity. We further show that shallow Transformers learn very different solutions than recurrent networks, and we prove that such models must use anti-Markovian solutions—solutions that do not rely on cumulative, Markovian hidden states—in the absence of sufficient attention layers.

NeurIPS Conference 2025 Conference Paper

In-Context Learning Strategies Emerge Rationally

  • Daniel Wurgaft
  • Ekdeep S Lubana
  • Core Francisco Park
  • Hidenori Tanaka
  • Gautam Reddy
  • Noah Goodman

Recent work analyzing in-context learning (ICL) has identified a broad set of strategies that describe model behavior in different experimental conditions. We aim to unify these findings by asking why a model learns these disparate strategies in the first place. Specifically, we start with the observation that when trained to learn a mixture of tasks, as is popular in the literature, the strategies learned by a model for performing ICL can be captured by a family of Bayesian predictors: a memorizing predictor, which assumes a discrete prior on the set of seen tasks, and a generalizing predictor, where the prior matches the underlying task distribution. Adopting the normative lens of rational analysis, where a learner’s behavior is explained as an optimal adaptation to data given computational constraints, we develop a hierarchical Bayesian framework that almost perfectly predicts Transformer next- token predictions throughout training—without assuming access to its weights. Under this framework, pretraining is viewed as a process of updating the posterior probability of different strategies, and inference-time behavior as a posterior- weighted average over these strategies’ predictions. Our framework draws on common assumptions about neural network learning dynamics, which make explicit a tradeoff between loss and complexity among candidate strategies: beyond how well it explains the data, a model’s preference towards implementing a strategy is dictated by its complexity. This helps explain well-known ICL phenomena, while offering novel predictions: e. g. , we show a superlinear trend in the timescale for transitioning from generalization to memorization as task diversity increases. Overall, our work advances an explanatory and predictive account of ICL grounded in tradeoffs between strategy loss and complexity.

TMLR Journal 2024 Journal Article

Certified Deductive Reasoning with Language Models

  • Gabriel Poesia
  • Kanishk Gandhi
  • Eric Zelikman
  • Noah Goodman

Language models often achieve higher accuracy when reasoning step-by-step in complex tasks. However, even when arriving at a correct final answer, their rationales are often logically unsound or inconsistent. This is a major issue when reliable reasoning traces are needed, such as when fine-tuning on model-generated reasoning for self-improvement. To tackle these issues, we introduce a class of tools for language models called guides, that use state and incremental constraints to guide generation. A guide can be invoked by the model to constrain its own generation to a set of valid statements given by the tool. In turn, the model’s choices can change the guide’s state. We show how a general system for logical reasoning can be used as a guide, which we call LogicGuide. Given a reasoning problem in natural language, a model can formalize its assumptions for LogicGuide and guarantee that its step-by-step reasoning is sound. In experiments on PrOntoQA, ProofWriter and Syllogism Validity datasets, LogicGuide significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLaMA (accuracy gains up to 35%), while drastically reducing content effects — the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLaMA using their own reasoning traces. We find that LogicGuide is critical: by training only on certified self-generated reasoning, models can self-improve, avoiding learning from their own hallucinations. Moreover, bootstrapped models enjoy significant boosts on ReClor, a challenging real-world reasoning dataset, even when not relying on formalization at inference time.

NeurIPS Conference 2023 Conference Paper

Feature Dropout: Revisiting the Role of Augmentations in Contrastive Learning

  • Alex Tamkin
  • Margalit Glasgow
  • Xiluo He
  • Noah Goodman

What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations can be useful in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e. g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e. g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models.

NeurIPS Conference 2023 Conference Paper

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

  • Zhengxuan Wu
  • Atticus Geiger
  • Thomas Icard
  • Christopher Potts
  • Noah Goodman

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call Boundless DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

NeurIPS Conference 2023 Conference Paper

Learning to Compress Prompts with Gist Tokens

  • Jesse Mu
  • Xiang Li
  • Noah Goodman

Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient. Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task. To avoid this trade-off entirely, we present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens which can be cached and reused for compute efficiency. Gist models can be trained with no additional cost over standard instruction finetuning by simply modifying Transformer attention masks to encourage prompt compression. On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4. 2% wall time speedups, and storage savings, all with minimal loss in output quality.

NeurIPS Conference 2023 Conference Paper

Parsel🐍: Algorithmic Reasoning with Language Models by Composing Decompositions

  • Eric Zelikman
  • Qian Huang
  • Gabriel Poesia
  • Noah Goodman
  • Nick Haber

Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs. With Parsel, we automatically decompose algorithmic tasks into hierarchical natural language function descriptions and then search over combinations of possible function implementations using tests. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis and robotic planning. We find that, using Parsel, LLMs solve more competition-level problems in the APPS dataset, resulting in pass rates over 75\% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. Moreover, with automatically generated tests, we find that Parsel can improve the state-of-the-art pass@1 performance on HumanEval from 67\% to 85\%. We also find that LLM-generated robotic plans using Parsel are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers. We release our code at https: //github. com/ezelikman/parsel.

NeurIPS Conference 2023 Conference Paper

Understanding Social Reasoning in Language Models with Language Models

  • Kanishk Gandhi
  • Jan-Philipp Fraenken
  • Tobias Gerstenberg
  • Noah Goodman

As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5, 000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.

NeurIPS Conference 2023 Conference Paper

Why think step by step? Reasoning emerges from the locality of experience

  • Ben Prystawski
  • Michael Li
  • Noah Goodman

Humans have a powerful and mysterious capacity to reason. Working through a set of mental steps enables us to make inferences we would not be capable of making directly even though we get no additional data from the world. Similarly, when large language models generate intermediate steps (a chain of thought) before answering a question, they often produce better answers than they would directly. We investigate why and how chain-of-thought reasoning is useful in language models, testing the hypothesis that reasoning is effective when training data consists of overlapping local clusters of variables that influence each other strongly. These training conditions enable the chaining of accurate local inferences to estimate relationships between variables that were not seen together in training. We prove that there will exist a "reasoning gap", where reasoning through intermediate variables reduces bias, for the simple case of an autoregressive density estimator trained on local samples from a chain-structured probabilistic model. We then test our hypothesis experimentally in more complex models, training an autoregressive language model on samples from Bayes nets but only including a subset of variables in each sample. We test language models’ ability to match conditional probabilities with and without intermediate reasoning steps, finding that intermediate steps are only helpful when the training data is locally structured with respect to dependencies between variables. The combination of locally structured observations and reasoning is much more data-efficient than training on all variables. Our results illustrate how the effectiveness of reasoning step by step is rooted in the local statistical structure of the training data.

NeurIPS Conference 2022 Conference Paper

Active Learning Helps Pretrained Models Learn the Intended Task

  • Alex Tamkin
  • Dat Nguyen
  • Salil Deshpande
  • Jesse Mu
  • Noah Goodman

Models can fail in unpredictable ways during deployment due to task ambiguity, when multiple behaviors are consistent with the provided training data. An example is an object classifier trained on red squares and blue circles: when encountering blue squares, the intended behavior is undefined. We investigate whether pretrained models are better active learners, capable of disambiguating between the possible tasks a user may be trying to specify. Intriguingly, we find that better active learning is an emergent property of the pretraining process: pretrained models require up to 5 times fewer labels when using uncertainty-based active learning, while non-pretrained models see no or even negative benefit. We find these gains come from an ability to select examples with attributes that disambiguate the intended behavior, such as rare product categories or atypical backgrounds. These attributes are far more linearly separable in pretrained model's representation spaces vs non-pretrained models, suggesting a possible mechanism for this behavior.

NeurIPS Conference 2022 Conference Paper

Assistive Teaching of Motor Control Tasks to Humans

  • Megha Srivastava
  • Erdem Biyik
  • Suvir Mirchandani
  • Noah Goodman
  • Dorsa Sadigh

Recent works on shared autonomy and assistive-AI technologies, such as assistive robotic teleoperation, seek to model and help human users with limited ability in a fixed task. However, these approaches often fail to account for humans' ability to adapt and eventually learn how to execute a control task themselves. Furthermore, in applications where it may be desirable for a human to intervene, these methods may have inhibited their ability to learn how to succeed with full self-control. In this paper, we focus on the problem of assistive teaching of motor control tasks such as parking a car or landing an aircraft. Despite their ubiquitous role in humans' daily activities and occupations, motor tasks are rarely taught in a uniform way due to their high complexity and variance. We propose an AI-assisted teaching algorithm that leverages skill discovery methods from reinforcement learning (RL) literature to (i) break down any motor control task into teachable skills, (ii) construct novel drill sequences, and (iii) individualize curricula to students with different capabilities. Through an extensive mix of synthetic and user studies on two motor control tasks - parking a car with a joystick and writing characters from the Balinese alphabet - we show that assisted teaching with skills improve student performance by around 40% compared to practicing full trajectories without skills, and practicing with individualized drills can result in up to 25% further improvement.

NeurIPS Conference 2022 Conference Paper

CLEVRER-Humans: Describing Physical and Causal Events the Human Way

  • Jiayuan Mao
  • Xuelin Yang
  • Xikun Zhang
  • Noah Goodman
  • Jiajun Wu

Building machines that can reason about physical events and their causal relationships is crucial for flexible interaction with the physical world. However, most existing physical and causal reasoning benchmarks are exclusively based on synthetically generated events and synthetic natural language descriptions of the causal relationships. This design brings up two issues. First, there is a lack of diversity in both event types and natural language descriptions; second, causal relationships based on manually-defined heuristics are different from human judgments. To address both shortcomings, we present the CLEVRER-Humans benchmark, a video reasoning dataset for causal judgment of physical events with human labels. We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models. We convert the collected CEGs into questions and answers to be consistent with prior work. Finally, we study a collection of baseline approaches for CLEVRER-Humans question-answering, highlighting great challenges set forth by our benchmark.

NeurIPS Conference 2022 Conference Paper

DABS 2.0: Improved Datasets and Algorithms for Universal Self-Supervision

  • Alex Tamkin
  • Gaurab Banerjee
  • Mohamed Owda
  • Vincent Liu
  • Shashank Rammoorthy
  • Noah Goodman

Universal self-supervised (SSL) algorithms hold enormous promise for making machine learning accessible to high-impact domains such as protein biology, manufacturing, and genomics. We present DABS 2. 0: a set of improved datasets and algorithms for advancing research on universal SSL. We extend the recently-introduced DABS benchmark with the addition of five real-world science and engineering domains: protein biology, bacterial genomics, multispectral satellite imagery, semiconductor wafers, and particle physics, bringing the total number of domains in the benchmark to twelve. We also propose a new universal SSL algorithm, Capri, and a generalized version of masked autoencoding, and apply both on all twelve domains---the most wide-ranging exploration of SSL yet. We find that multiple algorithms show gains across domains, outperforming previous baselines. In addition, we demonstrate the usefulness of DABS for scientific study of SSL by investigating the optimal corruption rate for each algorithm, showing that the best setting varies based on the domain. Code will be released at http: //github. com/alextamkin/dabs}{http: //github. com/alextamkin/dabs

NeurIPS Conference 2022 Conference Paper

Foundation Posteriors for Approximate Probabilistic Inference

  • Mike Wu
  • Noah Goodman

Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyper-parameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a "foundation" posterior able to do zero-shot inference for new programs. The foundation posterior can also be fine-tuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zero-shot and fine-tuned, on a benchmark of STAN programs.

NeurIPS Conference 2022 Conference Paper

Geoclidean: Few-Shot Generalization in Euclidean Geometry

  • Joy Hsu
  • Jiajun Wu
  • Noah Goodman

Euclidean geometry is among the earliest forms of mathematical thinking. While the geometric primitives underlying its constructions, such as perfect lines and circles, do not often occur in the natural world, humans rarely struggle to perceive and reason with them. Will computer vision models trained on natural images show the same sensitivity to Euclidean geometry? Here we explore these questions by studying few-shot generalization in the universe of Euclidean geometry constructions. We introduce Geoclidean, a domain-specific language for Euclidean geometry, and use it to generate two datasets of geometric concept learning tasks for benchmarking generalization judgements of humans and machines. We find that humans are indeed sensitive to Euclidean geometry and generalize strongly from a few visual examples of a geometric concept. In contrast, low-level and high-level visual features from standard computer vision models pretrained on natural images do not support correct generalization. Thus Geoclidean represents a novel few-shot generalization benchmark for geometric concept learning, where the performance of humans and of AI models diverge. The Geoclidean framework and dataset are publicly available for download.

NeurIPS Conference 2022 Conference Paper

Improving Intrinsic Exploration with Language Abstractions

  • Jesse Mu
  • Victor Zhong
  • Roberta Raileanu
  • Minqi Jiang
  • Noah Goodman
  • Tim Rocktäschel
  • Edward Grefenstette

Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al. , 2021) and NovelD (Zhang et al. , 2021). These language-based variants outperform their non-linguistic forms by 47-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.

NeurIPS Conference 2022 Conference Paper

STaR: Bootstrapping Reasoning With Reasoning

  • Eric Zelikman
  • Yuhuai Wu
  • Jesse Mu
  • Noah Goodman

Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.

NeurIPS Conference 2021 Conference Paper

Contrastive Reinforcement Learning of Symbolic Reasoning Domains

  • Gabriel Poesia
  • WenXin Dong
  • Noah Goodman

Abstract symbolic reasoning, as required in domains such as mathematics and logic, is a key component of human intelligence. Solvers for these domains have important applications, especially to computer-assisted education. But learning to solve symbolic problems is challenging for machine learning algorithms. Existing models either learn from human solutions or use hand-engineered features, making them expensive to apply in new domains. In this paper, we instead consider symbolic domains as simple environments where states and actions are given as unstructured text, and binary rewards indicate whether a problem is solved. This flexible setup makes it easy to specify new domains, but search and planning become challenging. We introduce five environments inspired by the Mathematics Common Core Curriculum, and observe that existing Reinforcement Learning baselines perform poorly. We then present a novel learning algorithm, Contrastive Policy Learning (ConPoLe) that explicitly optimizes the InfoNCE loss, which lower bounds the mutual information between the current state and next states that continue on a path to the solution. ConPoLe successfully solves all four domains. Moreover, problem representations learned by ConPoLe enable accurate prediction of the categories of problems in a real mathematics curriculum. Our results suggest new directions for reinforcement learning in symbolic domains, as well as applications to mathematics education.

NeurIPS Conference 2021 Conference Paper

DABS: a Domain-Agnostic Benchmark for Self-Supervised Learning

  • Alex Tamkin
  • Vincent Liu
  • Rongfei Lu
  • Daniel Fein
  • Colin Schultz
  • Noah Goodman

Self-supervised learning algorithms, including BERT and SimCLR, have enabled significant strides in fields like natural language processing, computer vision, and speech processing. However, these algorithms are domain-specific, meaning that new self-supervised learning algorithms must be developed for each new setting, including myriad healthcare, scientific, and multimodal domains. To catalyze progress toward domain-agnostic methods, we introduce DABS: a Domain-Agnostic Benchmark for Self-supervised learning. To perform well on DABS, an algorithm is evaluated on seven diverse domains: natural images, multichannel sensor data, English text, speech recordings, multilingual text, chest x-rays, and images with text descriptions. Each domain contains an unlabeled dataset for pretraining; the model is then is scored based on its downstream performance on a set of labeled tasks in the domain. We also present e-Mix and ShED: two baseline domain-agnostic algorithms; their relatively modest performance demonstrates that significant progress is needed before self-supervised learning is an out-of-the-box solution for arbitrary domains. Code for benchmark datasets and baseline algorithms is available at https: //github. com/alextamkin/dabs.

NeurIPS Conference 2021 Conference Paper

Emergent Communication of Generalizations

  • Jesse Mu
  • Noah Goodman

To build agents that can collaborate effectively with others, recent research has trained artificial agents to communicate with each other in Lewis-style referential games. However, this often leads to successful but uninterpretable communication. We argue that this is due to the game objective: communicating about a single object in a shared visual context is prone to overfitting and does not encourage language useful beyond concrete reference. In contrast, human language conveys a rich variety of abstract ideas. To promote such skills, we propose games that require communicating generalizations over sets of objects representing abstract visual concepts, optionally with separate contexts for each agent. We find that these games greatly improve systematicity and interpretability of the learned languages, according to several metrics in the literature. Finally, we propose a method for identifying logical operations embedded in the emergent languages by learning an approximate compositional reconstruction of the language.

NeurIPS Conference 2021 Conference Paper

Improving Compositionality of Neural Networks by Decoding Representations to Inputs

  • Mike Wu
  • Noah Goodman
  • Stefano Ermon

In traditional software programs, it is easy to trace program logic from variables back to input, apply assertion statements to block erroneous behavior, and compose programs together. Although deep learning programs have demonstrated strong performance on novel applications, they sacrifice many of the functionalities of traditional software programs. With this as motivation, we take a modest first step towards improving deep learning programs by jointly training a generative model to constrain neural network activations to "decode" back to inputs. We call this design a Decodable Neural Network, or DecNN. Doing so enables a form of compositionality in neural networks, where one can recursively compose DecNN with itself to create an ensemble-like model with uncertainty. In our experiments, we demonstrate applications of this uncertainty to out-of-distribution detection, adversarial example detection, and calibration --- while matching standard neural networks in accuracy. We further explore this compositionality by combining DecNN with pretrained models, where we show promising results that neural networks can be regularized from using protected features.

AAAI Conference 2021 Conference Paper

Pragmatic Code Autocomplete

  • Gabriel Poesia
  • Noah Goodman

Human language is ambiguous, with intended meanings recovered via pragmatic reasoning in context. Such reliance on context is essential for the efficiency of human communication. Programming languages, in stark contrast, are defined by unambiguous grammars. In this work, we aim to make programming languages more concise by allowing programmers to utilize a controlled level of ambiguity. Specifically, we allow single-character abbreviations for common keywords and identifiers. Our system first proposes a set of strings that can be abbreviated by the user. Using only 100 abbreviations, we observe that a corpus of Python code can be compressed by 15%, a number that can be improved even further by specializing the abbreviations to a particular code base. We then use a contextualized sequence-to-sequence model to rank potential expansions of inputs that include abbreviations. In an offline reconstruction task our model achieves accuracies ranging from 93% to 99%, depending on the programming language and user settings. The model is small enough to run on a commodity CPU in real-time. We evaluate the usability of our system in a user study, integrating it in Microsoft VS- Code, a popular code text editor. We observe that our system performs well and is complementary to traditional autocomplete features.

NeurIPS Conference 2020 Conference Paper

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

  • Alex Tamkin
  • Dan Jurafsky
  • Noah Goodman

Language exhibits structure at a wide range of scales, from subwords to words, sentences, paragraphs, and documents. We propose building models that isolate scale-specific information in deep representations, and develop methods for encouraging models during training to learn more about particular scales of interest. Our method for creating scale-specific neurons in deep NLP models constrains how the activation of a neuron can change across the tokens of an input by interpreting those activations as a digital signal and filtering out parts of its frequency spectrum. This technique enables us to extract scale-specific information from BERT representations: by filtering out different frequencies we can produce new representations that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for use during training, which constrains different neurons of a BERT model to different parts of the frequency spectrum. Our proposed BERT + Prism model is better able to predict masked tokens using long-range context, and produces individual multiscale representations that perform with comparable or improved performance across all three tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.

AAAI Conference 2020 Conference Paper

Meta-Amortized Variational Inference and Learning

  • Mike Wu
  • Kristy Choi
  • Noah Goodman
  • Stefano Ermon

Despite the recent success in probabilistic modeling and their applications, generative models trained using traditional inference techniques struggle to adapt to new distributions, even when the target distribution may be closely related to the ones seen during training. In this work, we present a doublyamortized variational inference procedure as a way to address this challenge. By sharing computation across not only a set of query inputs, but also a set of different, related probabilistic models, we learn transferable latent representations that generalize across several related distributions. In particular, given a set of distributions over images, we find the learned representations to transfer to different data transformations. We empirically demonstrate the effectiveness of our method by introducing the MetaVAE, and show that it significantly outperforms baselines on downstream image classification tasks on MNIST (10-50%) and NORB (10-35%).

NeurIPS Conference 2019 Conference Paper

Variational Bayesian Optimal Experimental Design

  • Adam Foster
  • Martin Jankowiak
  • Elias Bingham
  • Paul Horsfall
  • Yee Whye Teh
  • Thomas Rainforth
  • Noah Goodman

Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments.

AAAI Conference 2019 Conference Paper

Zero Shot Learning for Code Education: Rubric Sampling with Deep Learning Inference

  • Mike Wu
  • Milan Mosse
  • Noah Goodman
  • Chris Piech

In modern computer science education, massive open online courses (MOOCs) log thousands of hours of data about how students solve coding challenges. Being so rich in data, these platforms have garnered the interest of the machine learning community, with many new algorithms attempting to autonomously provide feedback to help future students learn. But what about those first hundred thousand students? In most educational contexts (i. e. classrooms), assignments do not have enough historical data for supervised learning. In this paper, we introduce a human-in-the-loop “rubric sampling” approach to tackle the “zero shot” feedback challenge. We are able to provide autonomous feedback for the first students working on an introductory programming assignment with accuracy that substantially outperforms data-hungry algorithms and approaches human level fidelity. Rubric sampling requires minimal teacher effort, can associate feedback with specific parts of a student’s solution and can articulate a student’s misconceptions in the language of the instructor. Deep learning inference enables rubric sampling to further improve as more assignment specific student data is acquired. We demonstrate our results on a novel dataset from Code. org, the world’s largest programming education platform.

NeurIPS Conference 2018 Conference Paper

Bias and Generalization in Deep Generative Models: An Empirical Study

  • Shengjia Zhao
  • Hongyu Ren
  • Arianna Yuan
  • Jiaming Song
  • Noah Goodman
  • Stefano Ermon

In high dimensional settings, density estimation algorithms rely crucially on their inductive bias. Despite recent empirical success, the inductive bias of deep generative models is not well understood. In this paper we propose a framework to systematically investigate bias and generalization in deep generative models of images by probing the learning algorithm with carefully designed training datasets. By measuring properties of the learned distribution, we are able to find interesting patterns of generalization. We verify that these patterns are consistent across datasets, common models and architectures.

NeurIPS Conference 2018 Conference Paper

Multimodal Generative Models for Scalable Weakly-Supervised Learning

  • Mike Wu
  • Noah Goodman

Multiple modalities often co-occur when describing natural phenomena. Learning a joint representation of these modalities should yield deeper and more useful representations. Previous generative approaches to multi-modal input either do not learn a joint distribution or require additional computation to handle missing data. Here, we introduce a multimodal variational autoencoder (MVAE) that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities. We apply the MVAE on four datasets and match state-of-the-art performance using many fewer parameters. In addition, we show that the MVAE is directly applicable to weakly-supervised learning, and is robust to incomplete supervision. We then consider two case studies, one of learning image transformations---edge detection, colorization, segmentation---as a set of modalities, followed by one of machine translation between two languages. We find appealing results across this range of tasks.

NeurIPS Conference 2017 Conference Paper

Learning Disentangled Representations with Semi-Supervised Deep Generative Models

  • Siddharth N
  • Brooks Paige
  • Jan-Willem van de Meent
  • Alban Desmaison
  • Noah Goodman
  • Pushmeet Kohli
  • Frank Wood
  • Philip Torr

Variational autoencoders (VAEs) learn representations of data by jointly training a probabilistic encoder and decoder network. Typically these models encode all features of the data into a single variable. Here we are interested in learning disentangled representations that encode distinct aspects of the data into separate variables. We propose to learn such representations using model architectures that generalise from standard VAEs, employing a general graphical model structure in the encoder and decoder. This allows us to train partially-specified models that make relatively strong assumptions about a subset of interpretable variables and rely on the flexibility of neural networks to learn representations for the remaining variables. We further define a general objective for semi-supervised learning in this model class, which can be approximated using an importance sampling procedure. We evaluate our framework's ability to learn disentangled representations, both by qualitative exploration of its generative capacity, and quantitative evaluation of its discriminative ability on a variety of models and datasets.

AAAI Conference 2016 Conference Paper

Learning the Preferences of Ignorant, Inconsistent Agents

  • Owain Evans
  • Andreas Stuhlmueller
  • Noah Goodman

An important use of machine learning is to learn what people value. What posts or photos should a user be shown? Which jobs or activities would a person find rewarding? In each case, observations of people’s past choices can inform our inferences about their likes and preferences. If we assume that choices are approximately optimal according to some utility function, we can treat preference inference as Bayesian inverse planning. That is, given a prior on utility functions and some observed choices, we invert an optimal decision-making process to infer a posterior distribution on utility functions. However, people often deviate from approximate optimality. They have false beliefs, their planning is sub-optimal, and their choices may be temporally inconsistent due to hyperbolic discounting and other biases. We demonstrate how to incorporate these deviations into algorithms for preference inference by constructing generative models of planning for agents who are subject to false beliefs and time inconsistency. We explore the inferences these models make about preferences, beliefs, and biases. We present a behavioral experiment in which human subjects perform preference inference given the same observations of choices as our model. Results show that human subjects (like our model) explain choices in terms of systematic deviations from optimal behavior and suggest that they take such deviations into account when inferring preferences.

NeurIPS Conference 2016 Conference Paper

Neurally-Guided Procedural Models: Amortized Inference for Procedural Graphics Programs using Neural Networks

  • Daniel Ritchie
  • Anna Thomas
  • Pat Hanrahan
  • Noah Goodman

Probabilistic inference algorithms such as Sequential Monte Carlo (SMC) provide powerful tools for constraining procedural models in computer graphics, but they require many samples to produce desirable results. In this paper, we show how to create procedural models which learn how to satisfy constraints. We augment procedural models with neural networks which control how the model makes random choices based on the output it has generated thus far. We call such models neurally-guided procedural models. As a pre-computation, we train these models to maximize the likelihood of example outputs generated via SMC. They are then used as efficient SMC importance samplers, generating high-quality results with very few samples. We evaluate our method on L-system-like models with image-based constraints. Given a desired quality threshold, neurally-guided models can generate satisfactory results up to 10x faster than unguided models.

NeurIPS Conference 2013 Conference Paper

Learning and using language via recursive pragmatic reasoning about other agents

  • Nathaniel Smith
  • Noah Goodman
  • Michael Frank

Language users are remarkably good at making inferences about speakers' intentions in context, and children learning their native language also display substantial skill in acquiring the meanings of unknown words. These two cases are deeply related: Language users invent new terms in conversation, and language learners learn the literal meanings of words based on their pragmatic inferences about how those words are used. While pragmatic inference and word learning have both been independently characterized in probabilistic terms, no current work unifies these two. We describe a model in which language learners assume that they jointly approximate a shared, external lexicon and reason recursively about the goals of others in using this lexicon. This model captures phenomena in word learning and pragmatic inference; it additionally leads to insights about the emergence of communicative systems in conversation and the mechanisms by which pragmatic inferences become incorporated into word meanings.

NeurIPS Conference 2013 Conference Paper

Learning Stochastic Inverses

  • Andreas Stuhlmüller
  • Jacob Taylor
  • Noah Goodman

We describe a class of algorithms for amortized inference in Bayesian networks. In this setting, we invest computation upfront to support rapid online inference for a wide range of queries. Our approach is based on learning an inverse factorization of a model's joint distribution: a factorization that turns observations into root nodes. Our algorithms accumulate information to estimate the local conditional distributions that constitute such a factorization. These stochastic inverses can be used to invert each of the computation steps leading to an observation, sampling backwards in order to quickly find a likely explanation. We show that estimated inverses converge asymptotically in number of (prior or posterior) training samples. To make use of inverses before convergence, we describe the Inverse MCMC algorithm, which uses stochastic inverses to make block proposals for a Metropolis-Hastings sampler. We explore the efficiency of this sampler for a variety of parameter regimes and Bayes nets.

NeurIPS Conference 2012 Conference Paper

Burn-in, bias, and the rationality of anchoring

  • Falk Lieder
  • Tom Griffiths
  • Noah Goodman

Bayesian inference provides a unifying framework for addressing problems in machine learning, artificial intelligence, and robotics, as well as the problems facing the human mind. Unfortunately, exact Bayesian inference is intractable in all but the simplest models. Therefore minds and machines have to approximate Bayesian inference. Approximate inference algorithms can achieve a wide range of time-accuracy tradeoffs, but what is the optimal tradeoff? We investigate time-accuracy tradeoffs using the Metropolis-Hastings algorithm as a metaphor for the mind's inference algorithm(s). We find that reasonably accurate decisions are possible long before the Markov chain has converged to the posterior distribution, i. e. during the period known as burn-in. Therefore the strategy that is optimal subject to the mind's bounded processing speed and opportunity costs may perform so few iterations that the resulting samples are biased towards the initial value. The resulting cognitive process model provides a rational basis for the anchoring-and-adjustment heuristic. The model's quantitative predictions are tested against published data on anchoring in numerical estimation tasks. Our theoretical and empirical results suggest that the anchoring bias is consistent with approximate Bayesian inference.

NeurIPS Conference 2011 Conference Paper

Nonstandard Interpretations of Probabilistic Programs for Efficient Inference

  • David Wingate
  • Noah Goodman
  • Andreas Stuhlmueller
  • Jeffrey Siskind

Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how nonstandard interpretations of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals.

NeurIPS Conference 2009 Conference Paper

Help or Hinder: Bayesian Models of Social Goal Inference

  • Tomer Ullman
  • Chris Baker
  • Owen Macindoe
  • Owain Evans
  • Noah Goodman
  • Joshua Tenenbaum

Everyday social interactions are heavily influenced by our snap judgments about others goals. Even young infants can infer the goals of intentional agents from observing how they interact with objects and other agents in their environment: e. g. , that one agent is helping or hindering anothers attempt to get up a hill or open a box. We propose a model for how people can infer these social goals from actions, based on inverse planning in multiagent Markov decision problems (MDPs). The model infers the goal most likely to be driving an agents behavior by assuming the agent acts approximately rationally given environmental constraints and its model of other agents present. We also present behavioral evidence in support of this model over a simpler, perceptual cue-based alternative.

NeurIPS Conference 2007 Conference Paper

A Bayesian Framework for Cross-Situational Word-Learning

  • Noah Goodman
  • Joshua Tenenbaum
  • Michael Black

For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the in- tended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and find it performs better than competing models. Fi- nally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues. To understand the difficulty of an infant word-learner, imagine walking down the street with a friend who suddenly says “dax blicket philbin na fivy! ” while at the same time wagging her elbow. If you knew any of these words you might infer from the syntax of her sentence that blicket is a novel noun, and hence the name of a novel object. At the same time, if you knew that this friend indicated her attention by wagging her elbow at objects, you might infer that she intends to refer to an object in a nearby show window. On the other hand if you already knew that “blicket” meant the object in the window, you might be able to infer these elements of syntax and social cues. Thus, the problem of early word-learning is a classic chicken-and-egg puzzle: in order to learn word meanings, learners must use their knowledge of the rest of language (including rules of syntax, parts of speech, and other word meanings) as well as their knowledge of social situations. But in order to learn about the facts of their language they must first learn some words, and in order to determine which cues matter for establishing reference (for instance, pointing and looking at an object but normally not waggling your elbow) they must first have a way to know the intended referent in some situations. For theories of language acquisition, there are two common ways out of this dilemma. The first involves positing a wide range of innate structures which determine the syntax and categories of a language and which social cues are informative. (Though even when all of these elements are innately determined using them to learn a language from evidence may not be trivial [1]. ) The other alternative involves bootstrapping: learning some words, then using those words to learn how to learn more. This paper gives a proposal for the second alternative. We first present a Bayesian model of how learners could use a statistical strategy—cross-situational word-learning—to learn how words map to objects, independent of syntactic and social cues. We then extend this model to a true bootstrapping situation: using social cues to learn words while using words to learn social cues. Finally, we examine several important phenomena in word learning: mutual exclusivity (the tendency to assign novel words to novel referents), fast-mapping (the ability to assign a novel word in a linguistic context to a novel referent after only a single use), and social generalization (the ability to use social context to learn the referent of a novel word). Without adding additional specialized machinery, we show how these can be explained within our model as the result of domain-general probabilistic inference mechanisms operating over the linguistic domain. 1 Figure 1: Graphical model de- scribing the generation of words (Ws) from an intention (Is) and lexicon ((cid: 96)), and intention from the objects present in a situa- tion (Os). The plate indicates multiple copies of the model for different situation/utterance pairs (s). Dotted portions indicate ad- ditions to include the generation of social cues Ss from intentions.

NeurIPS Conference 2007 Conference Paper

Learning and using relational theories

  • Charles Kemp
  • Noah Goodman
  • Joshua Tenenbaum

Much of human knowledge is organized into sophisticated systems that are often called intuitive theories. We propose that intuitive theories are mentally repre- sented in a logical language, and that the subjective complexity of a theory is determined by the length of its representation in this language. This complexity measure helps to explain how theories are learned from relational data, and how they support inductive inferences about unobserved relations. We describe two experiments that test our approach, and show that it provides a better account of human learning and reasoning than an approach developed by Goodman [1]. What is a theory, and what makes one theory better than another? Questions like these are of obvious interest to philosophers of science but are also discussed by psychologists, who have argued that everyday knowledge is organized into rich and complex systems that are similar in many respects to scientific theories. Even young children, for instance, have systematic beliefs about domains including folk physics, folk biology, and folk psychology [2]. Intuitive theories like these play many of the same roles as scientific theories: in particular, both kinds of theories are used to explain and encode observations of the world, and to predict future observations. This paper explores the nature, use and acquisition of simple theories. Consider, for instance, an anthropologist who has just begun to study the social structure of a remote tribe, and observes that certain words are used to indicate relationships between selected pairs of individuals. Suppose that term T1(·, ·) can be glossed as ancestor(·, ·), and that T2(·, ·) can be glossed as friend(·, ·). The anthropologist might discover that the first term is transitive, and that the second term is symmetric with a few exceptions. Suppose that term T3(·, ·) can be glossed as defers to(·, ·), and that the tribe divides into two castes such that members of the second caste defer to members of the first caste. In this case the anthropologist might discover two latent concepts (caste 1(·) and caste 2(·)) along with the relationship between these concepts. As these examples suggest, a theory can be defined as a system of laws and concepts that specify the relationships between the elements in some domain [2]. We will consider how these theories are learned, how they are used to encode relational data, and how they support predictions about unob- served relations. Our approach to all three problems relies on the notion of subjective complexity. We propose that theory learners prefer simple theories, that people remember relational data in terms of the simplest underlying theory, and that people extend a partially observed data set according to the simplest theory that is consistent with their observations. There is no guarantee that a single measure of subjective complexity can do all of the work that we require [3]. This paper, however, explores the strong hypothesis that a single measure will suffice. Our formal treatment of subjective complexity begins with the question of how theories are mentally represented. We suggest that theories are represented in some logical language, and propose a spe- cific first-order language that serves as a hypothesis about the “language of thought. ” We then pursue the idea that the subjective complexity of a theory corresponds to the length of its representation in this language. Our approach therefore builds on the work of Feldman [4], and is related to other psychological applications of the notion of Kolmogorov complexity [5]. The complexity measure we describe can be used to define a probability distribution over a space of theories, and we develop a model of theory acquisition by using this distribution as the prior for a Bayesian learner. We also