Arrow Research search

Author name cluster

Cong Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

ICLR Conference 2025 Conference Paper

Automated Design of Agentic Systems

  • Shengran Hu
  • Cong Lu
  • Jeff Clune

Researchers are investing substantial effort in developing powerful general-purpose agents, wherein Foundation Models are used as modules within agentic systems (e.g. Chain-of-Thought, Self-Reflection, Toolformer). However, the history of machine learning teaches us that hand-designed solutions are eventually replaced by learned solutions. We describe a newly forming research area, Automated Design of Agentic Systems (ADAS), which aims to automatically create powerful agentic system designs, including inventing novel building blocks and/or combining them in new ways. We further demonstrate that there is an unexplored yet promising approach within ADAS where agents can be defined in code and new agents can be automatically discovered by a meta agent programming ever better ones in code. Given that programming languages are Turing Complete, this approach theoretically enables the learning of any possible agentic system: including novel prompts, tool use, workflows, and combinations thereof. We present a simple yet effective algorithm named Meta Agent Search to demonstrate this idea, where a meta agent iteratively programs interesting new agents based on an ever-growing archive of previous discoveries. Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents. Importantly, we consistently observe the surprising result that agents invented by Meta Agent Search maintain superior performance even when transferred across domains and models, demonstrating their robustness and generality. Provided we develop it safely, our work illustrates the potential of an exciting new research direction toward automatically designing ever-more powerful agentic systems to benefit humanity.

EAAI Journal 2025 Journal Article

Ergonomic conscious scheduling of maintenance activities in marine vehicles using an optimized non-dominated sorting genetic algorithm-II – An application of job-shop scheduling

  • Shaban Usman
  • Cong Lu

Incorporating ergonomics in marine activities is critical due to the extreme working conditions and limited crew in marine vehicles, aiming to enhance productivity and job performance by reducing the risks of fatigue, stress, and work-related musculoskeletal disorders. This paper introduces an innovative analogy of the flexible job-shop scheduling problem with ergonomic considerations (AFJSP-ER) to schedule maintenance activities in marine systems, addressing the dual objectives of optimizing productivity and promoting ergonomic relief. A novel metric, ‘ergonomic impact load’ is introduced to assess the actual workload of the crew by combining the processing time and the rapid entire body assessment (REBA) score of an operation. To solve the AFJSP-ER, an optimized non-dominated sorting genetic algorithm-II (ONSGA) is proposed, incorporating an optimized random crossover (ORX) operator. The ORX operator is fine-tuned using the Taguchi method to determine the optimal number of elements for crossover, while non-dominated sorting ensures the selection of superior individuals after crossover and mutation. The effectiveness of the proposed ONSGA has been validated through extensive experiments on newly developed test instances and using an industrial case study from the ship engine compartment. The results also indicate that the AFJSP-ER approach effectively optimizes productivity and promotes ergonomic relief, offering a practical solution for scheduling in ergonomically challenging marine environments.

RLC Conference 2025 Conference Paper

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

  • Aaron Dharna
  • Cong Lu
  • Jeff Clune

Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across optima in policy space. We propose a family of approaches: (1) Vanilla FMSP (vFMSP) continually refines and improves an agent’s policy via competitive self-play; (2) Novelty-Search Self-Play (NSSP) builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, Quality-Diversity Self-Play (QDSP), creates a diverse set of high-quality policies by combining elements of both NSSP and vFMSP. We evaluate FMSPs in a continuous-control pursuer-evader setting (Car Tag) and in “Gandalf, ” a simple AI safety simulation in which an attacker tries to jailbreak an LLM’s defenses. In Car Tag, our algorithms explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, QDSP and vFMSP find policies that surpass strong human-designed strategies. In Gandalf, our algorithms can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs enable us to automatically close the loop and rapidly patch the discovered vulnerabilities. Overall, FMSP and its many possible variants represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery.

RLJ Journal 2025 Journal Article

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

  • Aaron Dharna
  • Cong Lu
  • Jeff Clune

Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across optima in policy space. We propose a family of approaches: (1) Vanilla FMSP (vFMSP) continually refines and improves an agent’s policy via competitive self-play; (2) Novelty-Search Self-Play (NSSP) builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, Quality-Diversity Self-Play (QDSP), creates a diverse set of high-quality policies by combining elements of both NSSP and vFMSP. We evaluate FMSPs in a continuous-control pursuer-evader setting (Car Tag) and in “Gandalf,” a simple AI safety simulation in which an attacker tries to jailbreak an LLM’s defenses. In Car Tag, our algorithms explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, QDSP and vFMSP find policies that surpass strong human-designed strategies. In Gandalf, our algorithms can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs enable us to automatically close the loop and rapidly patch the discovered vulnerabilities. Overall, FMSP and its many possible variants represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery.

IROS Conference 2025 Conference Paper

IGDrivSim: A Benchmark for the Imitation Gap in Autonomous Driving

  • Clémence Grislain
  • Risto Vuorio
  • Cong Lu
  • Shimon Whiteson

Developing autonomous vehicles that can navigate complex environments with human-level safety and efficiency is a central goal in self-driving research. A common approach to achieving this is imitation learning, where agents are trained to mimic human expert demonstrations collected from real- world driving scenarios. However, discrepancies between human perception and the self-driving car's sensors can introduce an imitation gap, leading to imitation learning failures. In this work, we introduce IGDrivSim, a benchmark built on top of the Waymax simulator, designed to investigate the effects of the imitation gap in learning autonomous driving policy from human expert demonstrations. Our experiments show that this perception gap between human experts and selfdriving agents can hinder the learning of safe and effective driving behaviors. We further show that combining imitation with reinforcement learning, using a simple penalty reward for prohibited behaviors, effectively mitigates these failures. All code developed for this work is released as open source 1.

ICLR Conference 2025 Conference Paper

Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models

  • Cong Lu
  • Shengran Hu
  • Jeff Clune

Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challenging problems including Atari games and robotic control, but requires manually designing heuristics to guide exploration (i.e., determine which states to save and explore from, and what actions to consider next), which is time-consuming and infeasible in general. To resolve this, we propose Intelligent Go-Explore (IGE) which greatly extends the scope of the original Go-Explore by replacing these handcrafted heuristics with the intelligence and internalized human notions of interestingness captured by giant pretrained foundation models (FMs). This provides IGE with a human-like ability to instinctively identify how interesting or promising any new state is (e.g., discovering new objects, locations, or behaviors), even in complex environments where heuristics are hard to define. Moreover, IGE offers the exciting opportunity to recognize and capitalize on serendipitous discoveries---states encountered during exploration that are valuable in terms of exploration, yet where what makes them interesting was not anticipated by the human user. We evaluate our algorithm on a diverse range of language and vision-based tasks that require search and exploration. Across these tasks, IGE strongly exceeds classic reinforcement learning and graph search baselines, and also succeeds where prior state-of-the-art FM agents like Reflexion completely fail. Overall, Intelligent Go-Explore combines the tremendous strengths of FMs and the powerful Go-Explore algorithm, opening up a new frontier of research into creating more generally capable agents with impressive exploration capabilities. All our code is open-sourced at: https://github.com/conglu1997/intelligent-go-explore.

RLJ Journal 2024 Journal Article

Policy-Guided Diffusion

  • Matthew Thomas Jackson
  • Michael Matthews
  • Cong Lu
  • Benjamin Ellis
  • Shimon Whiteson
  • Jakob Nicolaus Foerster

In many real-world settings, agents must learn from an offline dataset gathered by some prior behavior policy. Such a setting naturally leads to distribution shift between the behavior policy and the target policy being trained—requiring policy conservatism to avoid instability and overestimation bias. Autoregressive world models offer a different solution to this by generating synthetic, on-policy experience. However, in practice, model rollouts must be severely truncated to avoid compounding error. As an alternative, we propose policy-guided diffusion. Our method uses diffusion models to generate entire trajectories under the behavior distribution, applying guidance from the target policy to move synthetic experience further on-policy. We show that policy-guided diffusion represents a regularized form of the target distribution that balances action likelihood under both the target and behavior policies, leading to plausible trajectories with high target policy probability, while retaining a lower dynamics error than an offline world model baseline. Using synthetic experience from policy-guided diffusion as a drop-in substitute for real data, we demonstrate significant improvements in performance across a range of standard offline reinforcement learning algorithms and environments. Our approach provides an effective alternative to autoregressive offline world models, opening the door to the controllable generation of synthetic training data.

RLC Conference 2024 Conference Paper

Policy-Guided Diffusion

  • Matthew Thomas Jackson
  • Michael Matthews
  • Cong Lu
  • Benjamin Ellis
  • Shimon Whiteson
  • Jakob Nicolaus Foerster

In many real-world settings, agents must learn from an offline dataset gathered by some prior behavior policy. Such a setting naturally leads to distribution shift between the behavior policy and the target policy being trained—requiring policy conservatism to avoid instability and overestimation bias. Autoregressive world models offer a different solution to this by generating synthetic, on-policy experience. However, in practice, model rollouts must be severely truncated to avoid compounding error. As an alternative, we propose policy-guided diffusion. Our method uses diffusion models to generate entire trajectories under the behavior distribution, applying guidance from the target policy to move synthetic experience further on-policy. We show that policy-guided diffusion represents a regularized form of the target distribution that balances action likelihood under both the target and behavior policies, leading to plausible trajectories with high target policy probability, while retaining a lower dynamics error than an offline world model baseline. Using synthetic experience from policy-guided diffusion as a drop-in substitute for real data, we demonstrate significant improvements in performance across a range of standard offline reinforcement learning algorithms and environments. Our approach provides an effective alternative to autoregressive offline world models, opening the door to the controllable generation of synthetic training data.

NeurIPS Conference 2024 Conference Paper

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

  • Gunshi Gupta
  • Karmesh Yadav
  • Yarin Gal
  • Zsolt Kira
  • Dhruv Batra
  • Cong Lu
  • Tim G. Rudner

Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding—a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.

NeurIPS Conference 2024 Conference Paper

The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

  • Anya Sims
  • Cong Lu
  • Jakob N. Foerster
  • Yee W. Teh

Offline reinforcement learning (RL) aims to train agents from pre-collected datasets. However, this comes with the added challenge of estimating the value of behaviors not covered in the dataset. Model-based methods offer a potential solution by training an approximate dynamics model, which then allows collection of additional synthetic data via rollouts in this model. The prevailing theory treats this approach as online RL in an approximate dynamics model, and any remaining performance gap is therefore understood as being due to dynamics model errors. In this paper, we analyze this assumption and investigate how popular algorithms perform as the learned dynamics model is improved. In contrast to both intuition and theory, if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a key oversight: The theoretical foundations assume sampling of full horizon rollouts in the learned dynamics model; however, in practice, the number of model-rollout steps is aggressively reduced to prevent accumulating errors. We show that this truncation of rollouts results in a set of edge-of-reach states at which we are effectively "bootstrapping from the void. " This triggers pathological value overestimation and complete performance collapse. We term this the edge-of-reach problem. Based on this new insight, we fill important gaps in existing theory, and reveal how prior model-based methods are primarily addressing the edge-of-reach problem, rather than model-inaccuracy as claimed. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and hence - unlike existing methods - does not fail as the dynamics model is improved. Since world models will inevitably improve, we believe this is a key step towards future-proofing offline RL.

TMLR Journal 2024 Journal Article

Video Diffusion Models: A Survey

  • Andrew Melnik
  • Michal Ljubljanac
  • Cong Lu
  • Qi Yan
  • Weiming Ren
  • Helge Ritter

Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxonomy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: \url{https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models}

TMLR Journal 2023 Journal Article

Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

  • Cong Lu
  • Philip J. Ball
  • Tim G. J. Rudner
  • Jack Parker-Holder
  • Michael A Osborne
  • Yee Whye Teh

Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, offline reinforcement learning from visual observations with continuous action spaces remains under-explored, with a limited understanding of the key challenges in this complex domain. In this paper, we establish simple baselines for continuous control in the visual domain and introduce a suite of benchmarking tasks for offline reinforcement learning from visual observations designed to better represent the data distributions present in real-world offline RL problems and guided by a set of desiderata for offline RL from visual observations, including robustness to visual distractions and visually identifiable changes in dynamics. Using this suite of benchmarking tasks, we show that simple modifications to two popular vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform existing offline RL methods and establish competitive baselines for continuous control in the visual domain. We rigorously evaluate these algorithms and perform an empirical evaluation of the differences between state-of-the-art model-based and model-free offline RL methods for continuous control from visual observations. All code and data used in this evaluation are open-sourced to facilitate progress in this domain.

NeurIPS Conference 2023 Conference Paper

Synthetic Experience Replay

  • Cong Lu
  • Philip Ball
  • Yee Whye Teh
  • Jack Parker-Holder

A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or self-supervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusion-based approach to flexibly upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings, in both proprioceptive and pixel-based environments. In offline settings, we observe drastic improvements when upsampling small offline datasets and see that additional synthetic data also allows us to effectively train larger networks. Furthermore, SynthER enables online agents to train with a much higher update-to-data ratio than before, leading to a significant increase in sample efficiency, without any algorithmic changes. We believe that synthetic training data could open the door to realizing the full potential of deep learning for replay-based RL algorithms from limited data. Finally, we open-source our code at https: //github. com/conglu1997/SynthER.

ICLR Conference 2022 Conference Paper

Revisiting Design Choices in Offline Model Based Reinforcement Learning

  • Cong Lu
  • Philip J. Ball
  • Jack Parker-Holder
  • Michael A. Osborne
  • Stephen J. Roberts

Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies, circumventing the need for potentially expensive or unsafe online data collection. Significant progress has been made recently in offline model-based reinforcement learning, approaches which leverage a learned dynamics model. This typically involves constructing a probabilistic model, and using the model uncertainty to penalize rewards where there is insufficient data, solving for a pessimistic MDP that lower bounds the true MDP. Existing methods, however, exhibit a breakdown between theory and practice, whereby pessimistic return ought to be bounded by the total variation distance of the model from the true dynamics, but is instead implemented through a penalty based on estimated model uncertainty. This has spawned a variety of uncertainty heuristics, with little to no comparison between differing approaches. In this paper, we compare these heuristics, and design novel protocols to investigate their interaction with other hyperparameters, such as the number of models, or imaginary rollout horizon. Using these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces superior configurations that are vastly different to those currently used in existing hand-tuned state-of-the-art methods, and result in drastically stronger performance.

ICML Conference 2021 Conference Paper

Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment

  • Philip J. Ball
  • Cong Lu
  • Jack Parker-Holder
  • Stephen J. Roberts

Reinforcement learning from large-scale offline datasets provides us with the ability to learn policies without potentially unsafe or impractical exploration. Significant progress has been made in the past few years in dealing with the challenge of correcting for differing behavior between the data collection and learned policies. However, little attention has been paid to potentially changing dynamics when transferring a policy to the online setting, where performance can be up to 90% reduced for existing methods. In this paper we address this problem with Augmented World Models (AugWM). We augment a learned dynamics model with simple transformations that seek to capture potential changes in physical properties of the robot, leading to more robust policies. We not only train our policy in this new setting, but also provide it with the sampled augmentation as a context, allowing it to adapt to changes in the environment. At test time we learn the context in a self-supervised fashion by approximating the augmentation which corresponds to the new environment. We rigorously evaluate our approach on over 100 different changed dynamics settings, and show that this simple approach can significantly improve the zero-shot generalization of a recent state-of-the-art baseline, often achieving successful policies where the baseline fails.

ICML Conference 2021 Conference Paper

Exploration in Approximate Hyper-State Space for Meta Reinforcement Learning

  • Luisa M. Zintgraf
  • Leo Feng
  • Cong Lu
  • Maximilian Igl
  • Kristian Hartikainen
  • Katja Hofmann
  • Shimon Whiteson

To rapidly learn a new task, it is often essential for agents to explore efficiently - especially when performance matters from the first timestep. One way to learn such behaviour is via meta-learning. Many existing methods however rely on dense rewards for meta-training, and can fail catastrophically if the rewards are sparse. Without a suitable reward signal, the need for exploration during meta-training is exacerbated. To address this, we propose HyperX, which uses novel reward bonuses for meta-training to explore in approximate hyper-state space (where hyper-states represent the environment state and the agent’s task belief). We show empirically that HyperX meta-learns better task-exploration and adapts more successfully to new tasks than existing methods.

NeurIPS Conference 2021 Conference Paper

On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

  • Tim G. J. Rudner
  • Cong Lu
  • Michael A Osborne
  • Yarin Gal
  • Yee Teh

KL-regularized reinforcement learning from expert demonstrations has proved successful in improving the sample efficiency of deep reinforcement learning algorithms, allowing them to be applied to challenging physical real-world tasks. However, we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological training dynamics that can lead to slow, unstable, and suboptimal online learning. We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. Finally, we show that the pathology can be remedied by non-parametric behavioral reference policies and that this allows KL-regularized reinforcement learning to significantly outperform state-of-the-art approaches on a variety of challenging locomotion and dexterous hand manipulation tasks.

ICML Conference 2021 Conference Paper

Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces

  • Xingchen Wan
  • Vu Nguyen
  • Huong Ha 0001
  • Bin Xin Ru
  • Cong Lu
  • Michael A. Osborne

High-dimensional black-box optimisation remains an important yet notoriously challenging problem. Despite the success of Bayesian optimisation methods on continuous domains, domains that are categorical, or that mix continuous and categorical variables, remain challenging. We propose a novel solution—we combine local optimisation with a tailored kernel design, effectively handling high-dimensional categorical and mixed search spaces, whilst retaining sample efficiency. We further derive convergence guarantee for the proposed approach. Finally, we demonstrate empirically that our method outperforms the current baselines on a variety of synthetic and real-world tasks in terms of performance, computational costs, or both.

JMLR Journal 2021 Journal Article

VariBAD: Variational Bayes-Adaptive Deep RL via Meta-Learning

  • Luisa Zintgraf
  • Sebastian Schulze
  • Cong Lu
  • Leo Feng
  • Maximilian Igl
  • Kyriacos Shiarlis
  • Yarin Gal
  • Katja Hofmann

Trading off exploration and exploitation in an unknown environment is key to maximising expected online return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but also on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn approximately Bayes-optimal policies for complex tasks. VariBAD simultaneously meta-learns a variational auto-encoder to perform approximate inference, and a policy that incorporates task uncertainty directly during action selection by conditioning on both the environment state and the approximate belief. In two toy domains, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo tasks widely used in meta-RL and show that it achieves higher online return than existing methods. On the recently proposed Meta-World ML1 benchmark, variBAD achieves state of the art results by a large margin, fully solving two out of the three ML1 tasks for the first time. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2021. ( edit, beta )