Author name cluster

Yuri Burda

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

2 author rows

ICLR Conference 2024 Conference Paper

Let's Verify Step by Step

Hunter Lightman
Vineet Kosaraju
Yuri Burda
Harrison Edwards
Bowen Baker
Teddy Lee
Jan Leike
John Schulman

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Details

ICLR Conference 2018 Conference Paper

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments

Maruan Al-Shedivat
Trapit Bansal
Yuri Burda
Ilya Sutskever
Igor Mordatch
Pieter Abbeel

Ability to continuously learn and adapt from limited experience in nonstationary environments is an important milestone on the path towards general intelligence. In this paper, we cast the problem of continuous adaptation into the learning-to-learn framework. We develop a simple gradient-based meta-learning algorithm suitable for adaptation in dynamically changing and adversarial scenarios. Additionally, we design a new multi-agent competitive environment, RoboSumo, and define iterated adaptation games for testing various aspects of continuous adaptation. We demonstrate that meta-learning enables significantly more efficient adaptation than reactive baselines in the few-shot regime. Our experiments with a population of agents that learn and compete suggest that meta-learners are the fittest.

Details

AAMAS Conference 2018 Conference Paper

Evaluating Generalization in Multiagent Systems using Agent-Interaction Graphs

Aditya Grover
Maruan Al-Shedivat
Jayesh K. Gupta
Yuri Burda
Harrison Edwards

Learning from interactions between agents is a key component for inference in multiagent systems. Depending on the downstream task, there could be multiple criteria for evaluating the generalization performance of learning. In this work, we propose a novel framework for evaluating generalization in multiagent systems based on agent-interaction graphs. An agent-interaction graph models agents as nodes and interactions as hyper-edges between participating agents. Using this abstract data structure, we define three notions of generalization for principled evaluation of learning in multiagent systems.

PDF

ICML Conference 2018 Conference Paper

Learning Policy Representations in Multiagent Systems

Aditya Grover
Maruan Al-Shedivat
Jayesh K. Gupta
Yuri Burda
Harrison Edwards

Modeling agent behavior is central to understanding the emergence of complex phenomena in multiagent systems. Prior work in agent modeling has largely been task-specific and driven by hand-engineering domain-specific prior knowledge. We propose a general learning framework for modeling agent behavior in any multiagent system using only a handful of interaction data. Our framework casts agent modeling as a representation learning problem. Consequently, we construct a novel objective inspired by imitation learning and agent identification and design an algorithm for unsupervised learning of representations of agent policies. We demonstrate empirically the utility of the proposed framework in (i) a challenging high-dimensional competitive environment for continuous control and (ii) a cooperative environment for communication, on supervised predictive tasks, unsupervised clustering, and policy optimization using deep reinforcement learning.

Details