Arrow Research search

Author name cluster

Jacob Eisenstein

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

TMLR Journal 2025 Journal Article

ALTA: Compiler-Based Analysis of Transformers

  • Peter Shaw
  • James Cohan
  • Jacob Eisenstein
  • Kenton Lee
  • Jonathan Berant
  • Kristina Toutanova

We propose a new programming language called ALTA and a compiler that can map ALTA programs to Transformer weights. ALTA is inspired by RASP, a language proposed by Weiss et al. (2021), and Tracr (Lindner et al., 2023), a compiler from RASP programs to Transformer weights. ALTA complements and extends this prior work, offering the ability to express loops and to compile programs to Universal Transformers, among other advantages. ALTA allows us to constructively show how Transformers can represent length-invariant algorithms for computing parity and addition, as well as a solution to the SCAN benchmark of compositional generalization tasks, without requiring intermediate scratchpad decoding steps. We also propose tools to analyze cases where the expressibility of an algorithm is established, but end-to-end training on a given training set fails to induce behavior consistent with the desired algorithm. To this end, we explore training from ALTA execution traces as a more fine-grained supervision signal. This enables additional experiments and theoretical analyses relating the learnability of various algorithms to data availability and modeling decisions, such as positional encodings. We make the ALTA framework --- language specification, symbolic interpreter, and weight compiler --- available to the community to enable further applications and insights.

ICML Conference 2025 Conference Paper

InfAlign: Inference-aware language model alignment

  • Ananth Balashankar
  • Ziteng Sun
  • Jonathan Berant
  • Jacob Eisenstein
  • Michael Collins 0001
  • Adrian Hutter
  • Jong Lee
  • Chirag Nagpal

Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e. g. , Best-of-$N$, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-$N$ sampling and best-of-$N$ jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

ICLR Conference 2025 Conference Paper

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

  • Amrith Setlur
  • Chirag Nagpal
  • Adam Fisch
  • Xinyang Geng
  • Jacob Eisenstein
  • Rishabh Agarwal
  • Alekh Agarwal
  • Jonathan Berant

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. With the goal of using PRMs to improve a *base* policy via test-time search and reinforcement learning (RL), we ask: ``How should we design process rewards?'' Our key insight is that, to be effective, the process reward for a step should measure *progress*: a change in the likelihood of producing a correct response in the future, before and after taking the step, as measured under a *prover* policy distinct from the base policy. Such progress values can {distinguish} good and bad steps generated by the base policy, even though the base policy itself cannot. Theoretically, we show that even weaker provers can improve the base policy, as long as they distinguish steps without being too misaligned with the base policy. Our results show that process rewards defined as progress under such provers improve the efficiency of exploration during test-time search and online RL. We empirically validate our claims by training **process advantage verifiers (PAVs)** to measure progress under such provers and show that compared to ORM, they are >8% more accurate, and 1.5-5x more compute-efficient. Equipped with these insights, our PAVs enable **one of the first results** showing a 6x gain in sample efficiency for a policy trained using online RL with PRMs vs. ORMs.

TMLR Journal 2025 Journal Article

Robust Preference Optimization through Reward Model Distillation

  • Adam Fisch
  • Jacob Eisenstein
  • Vicky Zayats
  • Alekh Agarwal
  • Ahmad Beirami
  • Chirag Nagpal
  • Peter Shaw
  • Jonathan Berant

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, the empirical evidence suggests that DPO typically assigns implicit rewards that overfit, and trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM such that its induced implicit reward, i.e., the scaled log-likelihood ratio of the model to the reference model, matches an explicit reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

ICML Conference 2025 Conference Paper

Theoretical guarantees on the best-of-n alignment policy

  • Ahmad Beirami
  • Alekh Agarwal
  • Jonathan Berant
  • Alexander Nicholas D'Amour
  • Jacob Eisenstein
  • Chirag Nagpal
  • Ananda Theertha Suresh

A simple and effective method for the inference-time alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n. $ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.

ICML Conference 2024 Conference Paper

Transforming and Combining Rewards for Aligning Large Language Models

  • Zihao Wang
  • Chirag Nagpal
  • Jonathan Berant
  • Jacob Eisenstein
  • Alexander Nicholas D'Amour
  • Sanmi Koyejo
  • Victor Veitch

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is "better" than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. The derived transformation is straightforward: we apply a log-sigmoid function to the centered rewards, a method we term "LSC-transformation" (log-sigmoid-centered transformation). This transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is "good" in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

ICLR Conference 2022 Conference Paper

The MultiBERTs: BERT Reproductions for Robustness Analysis

  • Thibault Sellam
  • Steve Yadlowsky
  • Ian Tenney
  • Jason Wei
  • Naomi Saphra
  • Alexander Nicholas D'Amour
  • Tal Linzen
  • Jasmijn Bastings

Experiments with pre-trained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact tested in the experiment (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure which includes the architecture, training data, initialization scheme, and loss function. Recent work has shown that repeating the pre-training process can lead to substantially different performance, suggesting that an alternative strategy is needed to make principled statements about procedures. To enable researchers to draw more robust conclusions, we introduce MultiBERTs, a set of 25 BERT-Base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random weight initialization and shuffling of training data. We also define the Multi-Bootstrap, a non-parametric bootstrap method for statistical inference designed for settings where there are multiple pre-trained models and limited test data. To illustrate our approach, we present a case study of gender bias in coreference resolution, in which the Multi-Bootstrap lets us measure effects that may not be detected with a single checkpoint. The models and statistical library are available online, along with an additional set of 140 intermediate checkpoints captured during pre-training to facilitate research on learning dynamics.

JMLR Journal 2022 Journal Article

Underspecification Presents Challenges for Credibility in Modern Machine Learning

  • Alexander D'Amour
  • Katherine Heller
  • Dan Moldovan
  • Ben Adlam
  • Babak Alipanahi
  • Alex Beutel
  • Christina Chen
  • Jonathan Deaton

Machine learning (ML) systems often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification in ML pipelines as a key reason for these failures. An ML pipeline is the full procedure followed to train and validate a predictor. Such a pipeline is underspecified when it can return many distinct predictors with equivalently strong test performance. Underspecification is common in modern ML pipelines that primarily validate predictors on held-out data that follow the same distribution as the training data. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We provide evidence that underspecfication has substantive implications for practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2021 Conference Paper

Counterfactual Invariance to Spurious Correlations in Text Classification

  • Victor Veitch
  • Alexander D'Amour
  • Steve Yadlowsky
  • Jacob Eisenstein

Informally, a 'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e. g. , changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can 'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions. We connect counterfactual invariance to out-of-domain model performance, and provide practical schemes for learning (approximately) counterfactual invariant predictors (without access to counterfactual examples). It turns out that both the means and implications of counterfactual invariance depend fundamentally on the true underlying causal structure of the data---in particular, whether the label causes the features or the features cause the label. Distinct causal structures require distinct regularization schemes to induce counterfactual invariance. Similarly, counterfactual invariance implies different domain shift guarantees depending on the underlying causal structure. This theory is supported by empirical results on text classification.

AAAI Conference 2017 Conference Paper

Unsupervised Learning for Lexicon-Based Classification

  • Jacob Eisenstein

In lexicon-based classification, documents are assigned labels by comparing the number of words that appear from two opposed lexicons, such as positive and negative sentiment. Creating such words lists is often easier than labeling instances, and they can be debugged by non-experts if classification performance is unsatisfactory. However, there is little analysis or justification of this classification heuristic. This paper describes a set of assumptions that can be used to derive a probabilistic justification for lexicon-based classification, as well as an analysis of its expected accuracy. One key assumption behind lexicon-based classification is that all words in each lexicon are equally predictive. This is rarely true in practice, which is why lexicon-based approaches are usually outperformed by supervised classifiers that learn distinct weights on each word from labeled instances. This paper shows that it is possible to learn such weights without labeled data, by leveraging co-occurrence statistics across the lexicons. This offers the best of both worlds: light supervision in the form of lexicons, and data-driven classification with higher accuracy than traditional word-counting heuristics.

AAAI Conference 2008 Conference Paper

Discourse Topic and Gestural Form

  • Jacob Eisenstein

Coverbal gesture provides a channel for the visual expression of ideas. While some gestural emblems have culturally predefined forms (e. g. , “thumbs up”), the relationship between gesture and meaning is, in general, not conventionalized. It is natural to ask whether such gestures can be interpreted in a speaker-independent way, or whether gestural form is determined by the speaker’s idiosyncratic view of the discourse topic. We address this question using an audiovisual dataset across multiple speakers and topics. Our analysis employs a hierarchical Bayesian author-topic model, in which gestural patterns are stochastically generated by a mixture of speaker-specific and topic-specific priors. These gestural patterns are characterized using automaticallyextracted visual features, based on spatio-temporal interest points. This framework detects significant crossspeaker patterns in gesture that are governed by the discourse topic, suggesting that even unstructured gesticulation can be interpreted across speakers. In addition, the success of this approach shows that the semantic characteristics of gesture can be detected via a lowlevel, interest point representation.

AAAI Conference 2007 Conference Paper

Turning Lectures into Comic Books Using Linguistically Salient Gestures

  • Jacob Eisenstein

Creating video recordings of events such as lectures or meetings is increasingly inexpensive and easy. However, reviewing the content of such video may be time-consuming and difficult. Our goal is to produce a “comic book” summary, in which a transcript is augmented with keyframes that disambiguate and clarify accompanying text. Unlike most previous keyframe extraction systems which rely primarily on visual cues, we present a linguistically-motivated approach that selects keyframes that contain salient gestures. Rather than learning gesture salience directly, it is estimated by measuring the contribution of gesture to understanding other discourse phenomena. More specifically, we bootstrap from multimodal coreference resolution to identify gestures that improve performance. We then select keyframes that capture these gestures. Our model predicts gesture salience as a hidden variable in a conditional framework, with observable features from both the visual and textual modalities. This approach significantly outperforms competitive baselines that do not use gesture information.

AAAI Conference 1999 Short Paper

Learning Design Guidelines by Theory Refinement

  • Jacob Eisenstein
  • Stanford University

We add adaptation to an existing piece of automatic design software: the TIMM module of the MOBI-D user- interface design environment. MOBI-D maintains explicit, formal representations of the abstract and concrete sides of the interface. TIMM automates the mappings between the abstract domain objects and concrete presentation elements, using a decision tree.