Arrow Research search

Author name cluster

Alexandre Piché

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

TMLR Journal 2026 Journal Article

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

  • Alexandre Piché
  • Ehsan Kamalloo
  • Rafael Pardinas
  • Xiaoyin Chen
  • Dzmitry Bahdanau

Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately $\sim 2x$ faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.

TMLR Journal 2025 Journal Article

LLMs can learn self-restraint through iterative self-reflection

  • Alexandre Piché
  • Aristides Milios
  • Dzmitry Bahdanau
  • Christopher Pal

In order to be deployed safely, Large Language Models (LLMs) must be capable of dynamically adapting their behavior based on their level of knowledge and uncertainty associated with specific topics. This adaptive behavior, which we refer to as self-restraint, is non-trivial to teach since it depends on the internal knowledge of an LLM. By default, LLMs are trained to maximize the next token likelihood, which does not teach the model to modulate its answer based on its level of uncertainty. In order to learn self-restraint, we devise a utility function that can encourage the model to produce responses only when its level of confidence is above a user-specified target accuracy $\rho^*$. This utility function can be used to score generation of different length and abstention. To optimize this function, we introduce ReSearch, a process of ``self-reflection'' consisting of iterative self-prompting and self-evaluation. We use the ReSearch algorithm to generate synthetic data on which we finetune our models. ReSearch elegantly incorporates the ability to abstain by augmenting the samples generated by the model during the search procedure with an answer expressing abstention. Compared to their original versions, our resulting models generate fewer hallucinations overall at no additional inference cost, for both known and unknown topics, as the model learns to selectively restrain itself. In addition, we show that our iterative search is more efficient as a function of tokens than naive search. Finally, we show that by modifying the target accuracy $\rho^*$, our trained models exhibit different behaviors.

TMLR Journal 2024 Journal Article

Exploring validation metrics for offline model-based optimisation with diffusion models

  • Christopher Beckham
  • Alexandre Piché
  • David Vazquez
  • Christopher Pal

In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle, which is expensive to compute since it involves executing a real world process. In offline MBO we wish to do so without assuming access to such an oracle during training or validation, with makes evaluation non-straightforward. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. Measuring the mean reward of generated candidates over this approximation is one such `validation metric', whereas we are interested in a more fundamental question which is finding which validation metrics correlate the most with the ground truth. This involves proposing validation metrics and quantifying them over many datasets for which the ground truth is known, for instance simulated environments. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation, which is the ultimate goal behind leveraging generative models for MBO. While our evaluation framework is model agnostic we specifically evaluate denoising diffusion models due to their state-of-the-art performance, as well as derive interesting insights such as ranking the most effective validation metrics as well as discussing important hyperparameters.

TMLR Journal 2023 Journal Article

Bridging the Gap Between Target Networks and Functional Regularization

  • Alexandre Piché
  • Valentin Thomas
  • Joseph Marino
  • Rafael Pardinas
  • Gian Maria Marconi
  • Christopher Pal
  • Mohammad Emtiyaz Khan

Bootstrapping is behind much of the successes of deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages such as being inflexible and can result in instabilities, even when vanilla TD(0) converges. To overcome these issues, we propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space and we theoretically study its convergence. We conducted an experimental study across a range of environments, discount factors, and off-policiness data collections to investigate the effectiveness of the regularization induced by Target Networks and Functional Regularization in terms of performance, accuracy, and stability. Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement. Furthermore, adjusting both the regularization weight and the network update period in Functional Regularization can result in further performance improvements compared to solely adjusting the network update period as typically done with Target Networks. Our approach also enhances the ability to networks to recover accurate $Q$-values.

ICML Conference 2023 Conference Paper

Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels

  • Sai Rajeswar
  • Pietro Mazzaglia
  • Tim Verbelen
  • Alexandre Piché
  • Bart Dhoedt
  • Aaron C. Courville
  • Alexandre Lacoste

Controlling artificial agents from visual sensory data is an arduous task. Reinforcement learning (RL) algorithms can succeed but require large amounts of interactions between the agent and the environment. To alleviate the issue, unsupervised RL proposes to employ self-supervised interaction and learning, for adapting faster to future tasks. Yet, as shown in the Unsupervised RL Benchmark (URLB; Laskin et al. 2021), whether current unsupervised strategies can improve generalization capabilities is still unclear, especially in visual control settings. In this work, we study the URLB and propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent, and a task-aware fine-tuning strategy combined with a new proposed hybrid planner, Dyna-MPC, to adapt the agent for downstream tasks. On URLB, our method obtains 93. 59% overall normalized performance, surpassing previous baselines by a staggering margin. The approach is empirically evaluated through a large-scale empirical study, which we use to validate our design choices and analyze our models. We also show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation. Project website: https: //masteringurlb. github. io/

UAI Conference 2018 Conference Paper

Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields

  • Rémi Le Priol
  • Alexandre Piché
  • Simon Lacoste-Julien

This work investigates the training of conditional random fields (CRFs) via the stochastic dual coordinate ascent (SDCA) algorithm of Shalev-Shwartz and Zhang (2016). SDCA enjoys a linear convergence rate and a strong empirical performance for binary classification problems. However, it has never been used to train CRFs. Yet it benefits from an “exact” line search with a single marginalization oracle call, unlike previous approaches. In this paper, we adapt SDCA to train CRFs, and we enhance it with an adaptive non-uniform sampling strategy based on block duality gaps. We perform experiments on four standard sequence prediction tasks. SDCA demonstrates performances on par with the state of the art, and improves over it on three of the four datasets, which have in common the use of sparse features.