Arrow Research search

Author name cluster

John Schulman

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers
2 author rows

Possible papers

22

NeurIPS Conference 2025 Conference Paper

Quantifying Elicitation of Latent Capabilities in Language Models

  • Elizabeth Donoway
  • Hailey Joren
  • Arushi Somani
  • Henry Sleight
  • Julian Michael
  • Michael Deweese
  • John Schulman
  • Ethan Perez

Large language models often possess latent capabilities that lie dormant unless explicitly elicited, or surfaced, through fine-tuning or prompt engineering. Predicting, assessing, and understanding these latent capabilities pose significant challenges in the development of effective, safe AI systems. In this work, we recast elicitation as an information-constrained fine-tuning problem and empirically characterize upper bounds on the minimal number of parameters needed to achieve specific task performances. We find that training as few as 10–100 randomly chosen parameters—several orders of magnitude fewer than state-of-the-art parameter-efficient methods—can recover up to 50\% of the performance gap between pretrained-only and full fine-tuned models, and 1, 000s to 10, 000s of parameters can recover 95\% of this performance gap. We show that a logistic curve fits the relationship between the number of trained parameters and model performance gap recovery. This scaling generalizes across task formats and domains, as well as model sizes and families, extending to reasoning models and remaining robust to increases in inference compute. To help explain this behavior, we consider a simplified picture of elicitation via fine-tuning where each trainable parameter serves as an encoding mechanism for accessing task-specific knowledge. We observe a relationship between the number of trained parameters and how efficiently relevant model capabilities can be accessed and elicited, offering a potential route to distinguish elicitation from teaching.

ICLR Conference 2024 Conference Paper

Let's Verify Step by Step

  • Hunter Lightman
  • Vineet Kosaraju
  • Yuri Burda
  • Harrison Edwards
  • Bowen Baker
  • Teddy Lee
  • Jan Leike
  • John Schulman

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

NeurIPS Conference 2024 Conference Paper

Rule Based Rewards for Language Model Safety

  • Tong Mu
  • Alec Helyar
  • Johannes Heidecke
  • Joshua Achiam
  • Andrea Vallone
  • Ian Kivlichan
  • Molly Lin
  • Alex Beutel

Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a costly need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e. g. refusals should not be judgmental) along with a LLM grader. In contrast to prior methods using AI feedback, our method uses fine-grained, composable, LLM-graded few-shot prompts as reward directly in RL training, resulting in greater control, accuracy and ease of updating. We show that RBRs are an effective training method, achieving an F1 score of 97. 1, compared to a human-feedback baseline of 91. 7, resulting in much higher safety-behavior accuracy through better balancing usefulness and safety.

ICML Conference 2023 Conference Paper

Scaling Laws for Reward Model Overoptimization

  • Leo Gao
  • John Schulman
  • Jacob Hilton

In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart’s law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed “gold-standard” reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.

NeurIPS Conference 2022 Conference Paper

Batch size-invariance for policy optimization

  • Jacob Hilton
  • Karl Cobbe
  • John Schulman

We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.

NeurIPS Conference 2022 Conference Paper

Training language models to follow instructions with human feedback

  • Long Ouyang
  • Jeffrey Wu
  • Xu Jiang
  • Diogo Almeida
  • Carroll Wainwright
  • Pamela Mishkin
  • Chong Zhang
  • Sandhini Agarwal

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through a language model API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1. 3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

ICML Conference 2021 Conference Paper

Phasic Policy Gradient

  • Karl Cobbe
  • Jacob Hilton
  • Oleg Klimov
  • John Schulman

We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features. PPG also enables the value function to be more aggressively optimized with a higher level of sample reuse. Compared to PPO, we find that PPG significantly improves sample efficiency on the challenging Procgen Benchmark.

ICML Conference 2020 Conference Paper

Distribution Augmentation for Generative Modeling

  • Heewoo Jun
  • Rewon Child
  • Mark Chen 0003
  • John Schulman
  • Aditya Ramesh
  • Alec Radford
  • Ilya Sutskever

We present distribution augmentation (DistAug), a simple and powerful method of regularizing generative models. Our approach applies augmentation functions to data and, importantly, conditions the generative model on the specific function used. Unlike typical data augmentation, DistAug allows usage of functions which modify the target density, enabling aggressive augmentations more commonly seen in supervised and self-supervised learning. We demonstrate this is a more effective regularizer than standard methods, and use it to train a 152M parameter autoregressive model on CIFAR-10 to 2. 56 bits per dim (relative to the state-of-the-art 2. 80). Samples from this model attain FID 12. 75 and IS 8. 40, outperforming the majority of GANs. We further demonstrate the technique is broadly applicable across model architectures and problem domains.

ICML Conference 2020 Conference Paper

Leveraging Procedural Generation to Benchmark Reinforcement Learning

  • Karl Cobbe
  • Christopher Hesse
  • Jacob Hilton
  • John Schulman

We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimental protocols for using this benchmark. We empirically demonstrate that diverse environment distributions are essential to adequately train and evaluate RL agents, thereby motivating the extensive use of procedural content generation. We then use this benchmark to investigate the effects of scaling model size, finding that larger models significantly improve both sample efficiency and generalization.

ICML Conference 2019 Conference Paper

Quantifying Generalization in Reinforcement Learning

  • Karl Cobbe
  • Oleg Klimov
  • Christopher Hesse
  • Tae-Hoon Kim
  • John Schulman

In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent’s ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test sets. Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. Using CoinRun, we find that agents overfit to surprisingly large training sets. We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.

NeurIPS Conference 2017 Conference Paper

#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

  • Haoran Tang
  • Rein Houthooft
  • Davis Foote
  • Adam Stooke
  • OpenAI Xi Chen
  • Yan Duan
  • John Schulman
  • Filip DeTurck

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

ICML Conference 2016 Conference Paper

Benchmarking Deep Reinforcement Learning for Continuous Control

  • Yan Duan
  • Xi Chen 0022
  • Rein Houthooft
  • John Schulman
  • Pieter Abbeel

Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous control due to the lack of a commonly adopted benchmark. In this work, we present a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, tasks with partial observations, and tasks with hierarchical structure. We report novel findings based on the systematic evaluation of a range of implemented reinforcement learning algorithms. Both the benchmark and reference implementations are released at https: //github. com/rllab/rllab in order to facilitate experimental reproducibility and to encourage adoption by other researchers.

NeurIPS Conference 2016 Conference Paper

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

  • Xi Chen
  • Yan Duan
  • Rein Houthooft
  • John Schulman
  • Ilya Sutskever
  • Pieter Abbeel

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

NeurIPS Conference 2016 Conference Paper

VIME: Variational Information Maximizing Exploration

  • Rein Houthooft
  • Xi Chen
  • Yan Duan
  • John Schulman
  • Filip De Turck
  • Pieter Abbeel

Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls. This paper introduces Variational Information Maximizing Exploration (VIME), an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics. We propose a practical implementation, using variational inference in Bayesian neural networks which efficiently handles continuous state and action spaces. VIME modifies the MDP reward function, and can be applied with several different underlying RL algorithms. We demonstrate that VIME achieves significantly better performance compared to heuristic exploration methods across a variety of continuous control tasks and algorithms, including tasks with very sparse rewards.

EWRL Workshop 2015 Workshop Paper

Generalized Advantage Estimation for Policy Gradients

  • John Schulman
  • Philipp Moritz
  • Sergey Levine
  • Pieter Abbeel

Value functions provide an elegant solution to the delayed reward problem in reinforcement learning, but it is difficult to accurately estimate and approximate them when the state space is high-dimensional. As a result, policy gradient methods that use Monte Carlo estimation are often preferred over methods that approximate the value function. We propose a method for using an approximate value function to help estimate the advantage function and obtain better policy gradient estimates, even when the value function is inaccurate. These estimators use a timescale parameter that makes an explicit tradeoff between bias and variance, and they empirically achieve faster policy improvement than Monte Carlo estimation and the actor-critic method, which can be viewed as limiting cases of these estimators. We present experimental results on a standard cart-pole benchmark task, as well as a number of highly challenging 3D locomotion tasks, where we show that our approach can learn complex gaits using neural network function approximators with over 104 parameters for both the policy and the value function.

NeurIPS Conference 2015 Conference Paper

Gradient Estimation Using Stochastic Computation Graphs

  • John Schulman
  • Nicolas Heess
  • Theophane Weber
  • Pieter Abbeel

In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using samples, lies at the core of gradient-based learning algorithms for these problems. We introduce the formalism of stochastic computation graphs--directed acyclic graphs that include both deterministic functions and conditional probability distributions and describe how to easily and automatically derive an unbiased estimator of the loss function's gradient. The resulting algorithm for computing the gradient estimator is a simple modification of the standard backpropagation algorithm. The generic scheme we propose unifies estimators derived in variety of prior work, along with variance-reduction techniques therein. It could assist researchers in developing intricate models involving a combination of stochastic and deterministic operations, enabling, for example, attention, memory, and control actions.

ICML Conference 2015 Conference Paper

Trust Region Policy Optimization

  • John Schulman
  • Sergey Levine
  • Pieter Abbeel
  • Michael I. Jordan
  • Philipp Moritz

In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

ICRA Conference 2014 Conference Paper

Gaussian belief space planning with discontinuities in sensing domains

  • Sachin Patil
  • Yan Duan
  • John Schulman
  • Ken Goldberg
  • Pieter Abbeel

Discontinuities in sensing domains are common when planning for many robotic navigation and manipulation tasks. For cameras and 3D sensors, discontinuities may be inherent in sensor field of view or may change over time due to occlusions that are created by moving obstructions and movements of the sensor. The associated gaps in sensor information due to missing measurements pose a challenge for belief space and related optimization-based planning methods since there is no gradient information when the system state is outside the sensing domain. We address this in a belief space context by considering the signed distance to the sensing region. We smooth out sensing discontinuities by assuming that measurements can be obtained outside the sensing region with noise levels depending on a sigmoid function of the signed distance. We sequentially improve the continuous approximation by increasing the sigmoid slope over an outer loop to find plans that cope with sensor discontinuities. We also incorporate the information contained in not obtaining a measurement about the state during execution by appropriately truncating the Gaussian belief state. We present results in simulation for tasks with uncertainty involving navigation of mobile robots and reaching tasks with planar robot arms. Experiments suggest that the approach can be used to cope with discontinuities in sensing domains by effectively re-planning during execution.

ICRA Conference 2014 Conference Paper

Planning locally optimal, curvature-constrained trajectories in 3D using sequential convex optimization

  • Yan Duan
  • Sachin Patil
  • John Schulman
  • Ken Goldberg
  • Pieter Abbeel

3D curvature-constrained motion planning finds applications in a wide variety of domains, including motion planning for flexible, bevel-tip medical needles, planning curvature-constrained channels in 3D printed implants for targeted brachytherapy dose delivery or channels for cooling turbine blades, and path planning for unmanned aerial vehicles (UAVs). In this work, we present a motion planning technique using sequential convex optimization for computing locally optimal, curvature-constrained trajectories to desired targets while avoiding obstacles in 3D environments. We report two main contributions in this work: (i) curvature-constrained trajectory optimization in 6D pose (position and orientation) space, and (ii) planning multiple trajectories that are mutually collision-free. We demonstrate the performance of our approach on two clinically motivated applications. Our experiments indicate that our approach can compute high-quality plans for medical needle steering in 1. 6 seconds on a commodity PC, enabling re-planning during execution to correct for perturbations. Our approach can also be used for designing optimized channel layouts within 3D printed implants for intracavitary brachytherapy.

IROS Conference 2013 Conference Paper

A case study of trajectory transfer through non-rigid registration for a simplified suturing scenario

  • John Schulman
  • Ankush Gupta
  • Sibi Venkatesan
  • Mallory Tayson-Frederick
  • Pieter Abbeel

Suturing is an important yet time-consuming part of surgery. A fast and robust autonomous procedure could reduce surgeon fatigue, and shorten operation times. It could also be of particular importance for suturing in remote tele-surgery settings where latency can complicate the master-slave mode control that is the current practice for robotic surgery with systems like the da Vinci®. We study the applicability of the trajectory transfer algorithm proposed in [12] to the automation of suturing. The core idea of this procedure is to first use non-rigid registration to find a 3D warping function which maps the demonstration scene onto the test scene, then use this warping function to transform the robot end-effector trajectory. Finally a robot joint trajectory is generated by solving a trajectory optimization problem that attempts to find the closest feasible trajectory, accounting for external constraints, such as joint limits and obstacles. Our experiments investigate generalization from a single demonstration to differing initial conditions. A first set of experiments considers the problem of having a simulated Raven II system [5] suture two flaps of tissue together. A second set of experiments considers a PR2 robot performing sutures in a scaled-up experimental setup. The simulation experiments were fully autonomous. For the real-world experiments we provided human input to assist with the detection of landmarks to be fed into the registration algorithm. The success rate for learning from a single demonstration is high for moderate perturbations from the demonstration's initial conditions, and it gradually decreases for larger perturbations.

IROS Conference 2013 Conference Paper

Sigma hulls for Gaussian belief space planning for imprecise articulated robots amid obstacles

  • Alex X. Lee
  • Yan Duan
  • Sachin Patil
  • John Schulman
  • Zoe McCarthy
  • Jur van den Berg
  • Ken Goldberg
  • Pieter Abbeel

In many home and service applications, an emerging class of articulated robots such as the Raven and Baxter trade off precision in actuation and sensing to reduce costs and to reduce the potential for injury to humans in their workspaces. For planning and control of such robots, planning in belief ssigma hullpace, i. e. , modeling such problems as POMDPs, has shown great promise but existing belief space planning methods have primarily been applied to cases where robots can be approximated as points or spheres. In this paper, we extend the belief space framework to treat articulated robots where the linkage can be decomposed into convex components. To allow planning and collision avoidance in Gaussian belief spaces, we introduce the concept of sigma hulls: convex hulls of robot links transformed according to the sigma standard deviation boundary points generated by the Unscented Kalman filter (UKF). We characterize the signed distances between sigma hulls and obstacles in the workspace to formulate efficient collision avoidance constraints compatible with the Gilbert-Johnson-Keerthi (GKJ) and Expanding Polytope Algorithms (EPA) within an optimization-based planning framework. We report results in simulation for planning motions for a 4-DOF planar robot and a 7-DOF articulated robot with imprecise actuation and inaccurate sensors. These experiments suggest that the sigma hull framework can significantly reduce the probability of collision and is computationally efficient enough to permit iterative re-planning for model predictive control.

ICRA Conference 2013 Conference Paper

Tracking deformable objects with point clouds

  • John Schulman
  • Alex X. Lee
  • Jonathan Ho
  • Pieter Abbeel

We introduce an algorithm for tracking deformable objects from a sequence of point clouds. The proposed tracking algorithm is based on a probabilistic generative model that incorporates observations of the point cloud and the physical properties of the tracked object and its environment. We propose a modified expectation maximization algorithm to perform maximum a posteriori estimation to update the state estimate at each time step. Our modification makes it practical to perform the inference through calls to a physics simulation engine. This is significant because (i) it allows for the use of highly optimized physics simulation engines for the core computations of our tracking algorithm, and (ii) it makes it possible to naturally, and efficiently, account for physical constraints imposed by collisions, grasping actions, and material properties in the observation updates. Even in the presence of the relatively large occlusions that occur during manipulation tasks, our algorithm is able to robustly track a variety of types of deformable objects, including ones that are one-dimensional, such as ropes; two-dimensional, such as cloth; and three-dimensional, such as sponges. Our implementation can track these objects in real time.