Author name cluster

Robert Dadashi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

ICLR Conference 2025 Conference Paper

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa
Robert Dadashi
Léonard Hussenot
Johan Ferret
Nino Vieillard
Alexandre Ramé
Bobak Shahriari
Sarah Perrin

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.

Details

ICML Conference 2024 Conference Paper

WARM: On the Benefits of Weight Averaged Reward Models

Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79. 4% win rate against a policy RL fine-tuned with a single RM.

Details

ICML Conference 2022 Conference Paper

Continuous Control with Action Quantization from Demonstrations

Robert Dadashi
Léonard Hussenot
Damien Vincent
Sertan Girgin
Anton Raichuk
Matthieu Geist
Olivier Pietquin

In this paper, we propose a novel Reinforcement Learning (RL) framework for problems with continuous action spaces: Action Quantization from Demonstrations (AQuaDem). The proposed approach consists in learning a discretization of continuous action spaces from human demonstrations. This discretization returns a set of plausible actions (in light of the demonstrations) for each input state, thus capturing the priors of the demonstrator and their multimodal behavior. By discretizing the action space, any discrete action deep RL technique can be readily applied to the continuous control problem. Experiments show that the proposed approach outperforms state-of-the-art methods such as SAC in the RL setup, and GAIL in the Imitation Learning setup. We provide a website with interactive videos: https: //google-research. github. io/aquadem/ and make the code available: https: //github. com/google-research/google-research/tree/master/aquadem.

Details

EWRL Workshop 2022 Workshop Paper

Continuous Control with Action Quantization from Demonstrations

Robert Dadashi
Léonard Hussenot
Damien Vincent
Sertan Girgin
Anton Raichuk
Matthieu Geist
Olivier Pietquin

PDF Details

EWRL Workshop 2022 Workshop Paper

Get Back Here: Robust Imitation by Return-to-Distribution Planning

Geoffrey Cideron
Olivier Pietquin
Robert Dadashi
Gabriel Dulac-Arnold
Baruch Tabanpour
Matthieu Geist
Léonard Hussenot
Sebastian Curi

We consider the Imitation Learning (IL) setup where expert data are not collected on the actual deployment environment but on a different version. To address the resulting distribution shift, we combine behavior cloning (BC) with a planner that is tasked to bring the agent back to states visited by the expert whenever the agent deviates from the demonstration distribution. The resulting algorithm, POIR, can be trained offline, and leverages online interactions to fine-tune its planner to improve performance over time. We test POIR on a variety of human-generated manipulation demonstrations and show robustness of the learned policy to different initial state distributions and different dynamics.

PDF Details

NeurIPS Conference 2022 Conference Paper

Learning Energy Networks with Generalized Fenchel-Young Losses

Mathieu Blondel
Felipe Llinares-Lopez
Robert Dadashi
Leonard Hussenot
Matthieu Geist

Energy-based models, a. k. a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs andoutputs. To learn the parameters of the energy function, the solution to thatoptimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction forlearning energy networks. Our losses enjoy many desirable properties and theirgradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concaveenergies. We demonstrate our losses on multilabel classification and imitation learning tasks.

PDF Details

AAAI Conference 2022 Conference Paper

Offline Reinforcement Learning as Anti-exploration

Shideh Rezaeifar
Robert Dadashi
Nino Vieillard
Léonard Hussenot
Olivier Bachem
Olivier Pietquin
Matthieu Geist

Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward, instead of adding it for exploration. This allows the policy to stay close to the support of the dataset, and practically extends some previous pessimism-based offline RL methods to a deep learning setting with arbitrary bonuses. We also connect this approach to a more common regularization of the learned policy towards the data. Instantiated with a bonus based on the prediction error of a variational autoencoder, we show that our simple agent is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks.

PDF Details

ICML Conference 2021 Conference Paper

Hyperparameter Selection for Imitation Learning

Léonard Hussenot
Marcin Andrychowicz
Damien Vincent
Robert Dadashi
Anton Raichuk
Sabela Ramos
Nikola Momchev
Sertan Girgin

We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms in the context of continuous-control, when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, but this is not a realistic setting. Indeed, would this reward function be available, it could then directly be used for policy training and imitation would not be necessary. To tackle this mostly ignored problem, we propose a number of possible proxies to the external reward. We evaluate them in an extensive empirical study (more than 10’000 agents across 9 environments) and make practical recommendations for selecting HPs. Our results show that while imitation learning algorithms are sensitive to HP choices, it is often possible to select good enough HPs through a proxy to the reward function.

Details

ICML Conference 2021 Conference Paper

Offline Reinforcement Learning with Pseudometric Learning

Robert Dadashi
Shideh Rezaeifar
Nino Vieillard
Léonard Hussenot
Olivier Pietquin
Matthieu Geist

Offline Reinforcement Learning methods seek to learn a policy from logged transitions of an environment, without any interaction. In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs close to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric (closely related to bisimulation metrics) from logged transitions, and use it to define this notion of closeness. We show its convergence and extend it to the function approximation setting. We then use this pseudometric to define a new lookup based bonus in an actor-critic algorithm: PLOFF. This bonus encourages the actor to stay close, in terms of the defined pseudometric, to the support of logged transitions. Finally, we evaluate the method on hand manipulation and locomotion tasks.

Details

ICLR Conference 2021 Conference Paper

Primal Wasserstein Imitation Learning

Robert Dadashi
Léonard Hussenot
Matthieu Geist
Olivier Pietquin

Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.

Details

AAMAS Conference 2021 Conference Paper

Show Me the Way: Intrinsic Motivation from Demonstrations

Léonard Hussenot
Robert Dadashi
Matthieu Geist
Olivier Pietquin

The study of exploration in the domain of decision making has a long history but remains actively debated. From the vast literature that addressed this topic for decades under various points of view (e. g. , developmental psychology, experimental design, artificial intelligence), intrinsic motivation emerged as a concept that can practically be transferred to artificial agents. Especially, in the recent field of Deep Reinforcement Learning (RL), agents implement such a concept (mainly using a novelty argument) in the shape of an exploration bonus, added to the task reward, that encourages visiting the whole environment. This approach is supported by the large amount of theory on RL for which convergence to optimality assumes exhaustive exploration. Yet, Human Beings and mammals do not exhaustively explore the world and their motivation is not only based on novelty but also on various other factors (e. g. , curiosity, fun, style, pleasure, safety, competition, etc.). They optimize for life-long learning and train to learn transferable skills in playgrounds without obvious goals. They also apply innate or learned priors to save time and stay safe. For these reasons, we propose to learn an exploration bonus from demonstrations that could transfer these motivations to an artificial agent with little assumptions about their rationale. Using an inverse RL approach, we show that complex exploration behaviors, reflecting different motivations, can be learnt and efficiently used by RL agents to solve tasks for which exhaustive exploration is prohibitive.

PDF

AAAI Conference 2021 Conference Paper

The Value-Improvement Path: Towards Better Representations for Reinforcement Learning

Will Dabney
André Barreto
Mark Rowland
Robert Dadashi
John Quan
Marc G. Bellemare
David Silver

In value-based reinforcement learning (RL), unlike in supervised learning, the agent faces not a single, stationary, approximation problem, but a sequence of value prediction problems. Each time the policy improves, the nature of the problem changes, shifting both the distribution of states and their values. In this paper we take a novel perspective, arguing that the value prediction problems faced by an RL agent should not be addressed in isolation, but rather as a single, holistic, prediction problem. An RL algorithm generates a sequence of policies that, approximately, improve towards the optimal policy. We explicitly characterize the associated sequence of value functions and call it the value-improvement path. Our main idea is to approximate the value-improvement path holistically, rather than to solely track the value function of the current policy. Specifically, we discuss the impact that this holistic view of RL has on representation learning. We demonstrate that a representation that spans the past valueimprovement path will also provide an accurate value approximation for future policy improvements. We use this insight to better understand existing approaches to auxiliary tasks and to propose new ones. To test our hypothesis empirically, we augmented a standard deep RL agent with an auxiliary task of learning the value-improvement path. In a study of Atari 2600 games, the augmented agent achieved approximately double the mean and median performance of the baseline agent.

PDF Details

NeurIPS Conference 2021 Conference Paper

What Matters for Adversarial Imitation Learning?

Manu Orsini
Anton Raichuk
Leonard Hussenot
Damien Vincent
Robert Dadashi
Sertan Girgin
Matthieu Geist
Olivier Bachem

Adversarial imitation learning has become a popular framework for imitation in continuous control. Over the years, several variations of its components were proposed to enhance the performance of the learned policies as well as the sample complexity of the algorithm. In practice, these choices are rarely tested all together in rigorous empirical studies. It is therefore difficult to discuss and understand what choices, among the high-level algorithmic options as well as low-level implementation details, matter. To tackle this issue, we implement more than 50 of these choices in a generic adversarial imitation learning frameworkand investigate their impacts in a large-scale study (>500k trained agents) with both synthetic and human-generated demonstrations. We analyze the key results and highlight the most surprising findings.

PDF Details

NeurIPS Conference 2019 Conference Paper

A Geometric Perspective on Optimal Representations for Reinforcement Learning

Marc Bellemare
Will Dabney
Robert Dadashi
Adrien Ali Taiga
Pablo Samuel Castro
Nicolas Le Roux
Dale Schuurmans
Tor Lattimore

We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. From there, we provide formal evidence regarding the usefulness of value functions as auxiliary tasks in reinforcement learning. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain.

PDF Details

ICML Conference 2019 Conference Paper

Statistics and Samples in Distributional Reinforcement Learning

Mark Rowland 0001
Robert Dadashi
Saurabh Kumar 0004
Rémi Munos
Marc G. Bellemare
Will Dabney

We present a unifying framework for designing and analysing distributional reinforcement learning (DRL) algorithms in terms of recursively estimating statistics of the return distribution. Our key insight is that DRL algorithms can be decomposed as the combination of some statistical estimator and a method for imputing a return distribution consistent with that set of statistics. With this new understanding, we are able to provide improved analyses of existing DRL algorithms as well as construct a new algorithm (EDRL) based upon estimation of the expectiles of the return distribution. We compare EDRL with existing methods on a variety of MDPs to illustrate concrete aspects of our analysis, and develop a deep RL variant of the algorithm, ER-DQN, which we evaluate on the Atari-57 suite of games.

Details

ICML Conference 2019 Conference Paper

The Value Function Polytope in Reinforcement Learning

Robert Dadashi
Marc G. Bellemare
Adrien Ali Taïga
Nicolas Le Roux
Dale Schuurmans

We establish geometric and topological properties of the space of value functions in finite state-action Markov decision processes. Our main contribution is the characterization of the nature of its shape: a general polytope (Aigner et al. , 2010). To demonstrate this result, we exhibit several properties of the structural relationship between policies and value functions including the line theorem, which shows that the value functions of policies constrained on all but one state describe a line segment. Finally, we use this novel perspective and introduce visualizations to enhance the understanding of the dynamics of reinforcement learning algorithms.

Details