Author name cluster

Marc Rigter

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

RLC Conference 2025 Conference Paper

AVID: Adapting Video Diffusion Models to World Models

Marc Rigter
Tarun Gupta
Agrin Hilmkil
Chao Ma

Reinforcement learning (RL) is highly effective in domains that can be easily simulated. However, in problems such as robotic manipulation, accurate simulation is challenging and gathering large amounts of real-world data is impractical. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to generate synthetic data to optimize decision-making via RL. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it generally outperforms baselines for diffusion adaptation in video and image metrics. AVID demonstrates that pretrained video models have the potential to be powerful tools for generating synthetic data for RL agents. In future work, we wish to investigate how the improved data generation accuracy translates to model-based RL performance.

PDF Details

RLJ Journal 2025 Journal Article

AVID: Adapting Video Diffusion Models to World Models

Marc Rigter
Tarun Gupta
Agrin Hilmkil
Chao Ma

PDF Details

ICLR Conference 2024 Conference Paper

Reward-Free Curricula for Training Robust World Models

Marc Rigter
Minqi Jiang
Ingmar Posner

There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms naı̈ve domain randomisation, resulting in improved robustness, efficiency, and generalisation.

Details

ICRA Conference 2024 Conference Paper

TWIST: Teacher-Student World Model Distillation for Efficient Sim-to-Real Transfer

Jun Yamada
Marc Rigter
Jack Collins
Ingmar Posner

Model-based RL is a promising approach for real-world robotics due to its improved sample efficiency and generalization capabilities compared to model-free RL. However, effective model-based RL solutions for vision-based real-world applications require bridging the sim-to-real gap for any world model learnt. Due to its significant computational cost, standard domain randomisation does not provide an effective solution to this problem. This paper proposes TWIST (Teacher-Student World Model Distillation for Sim-to-Real Transfer) to achieve efficient sim-to-real transfer of vision-based model-based RL using distillation. Specifically, TWIST leverages state observations as readily accessible, privileged information commonly garnered from a simulator to significantly accelerate sim-to-real transfer. Specifically, a teacher world model is trained efficiently on state information. At the same time, a matching dataset is collected of domain-randomised image observations. The teacher world model then supervises a student world model that takes the domain-randomised image observations as input. By distilling the learned latent dynamics model from the teacher to the student model, TWIST achieves efficient and effective sim-to-real transfer for vision-based model-based RL tasks. Experiments in simulated and real robotics tasks demonstrate that our approach outperforms naive domain randomisation and model-free methods in terms of sample efficiency and task performance of sim-to-real transfer.

Details

TMLR Journal 2024 Journal Article

World Models via Policy-Guided Trajectory Diffusion

Marc Rigter
Jun Yamada
Ingmar Posner

World models are a powerful tool for developing intelligent agents. By predicting the outcome of a sequence of actions, world models enable policies to be optimised via on-policy reinforcement learning (RL) using synthetic data, i.e. in “in imagination”. Existing world models are autoregressive in that they interleave predicting the next state with sampling the next action from the policy. Prediction error inevitably compounds as the trajectory length grows. In this work, we propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion (PolyGRAD), leverages a denoising model in addition to the gradient of the action distribution of the policy to diffuse a trajectory of initially random states and actions into an on-policy synthetic trajectory. We analyse the connections between PolyGRAD, score-based generative models, and classifier-guided diffusion models. Our results demonstrate that PolyGRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for short trajectories, with the exception of autoregressive diffusion. For short trajectories, PolyGRAD obtains similar errors to autoregressive diffusion, but with lower computational requirements. For long trajectories, PolyGRAD obtains comparable performance to baselines. Our experiments demonstrate that PolyGRAD enables performant policies to be trained via on-policy RL in imagination for MuJoCo continuous control domains. Thus, PolyGRAD introduces a new paradigm for accurate on-policy world modelling without autoregressive sampling.

PDF Details

NeurIPS Conference 2023 Conference Paper

One Risk to Rule Them All: A Risk-Sensitive Perspective on Model-Based Offline Reinforcement Learning

Marc Rigter
Bruno Lacerda
Nick Hawes

Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is not feasible. In such domains, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-averse. An additional challenge of offline RL is avoiding distributional shift, i. e. ensuring that state-action pairs visited by the policy remain near those in the dataset. Previous offline RL algorithms that consider risk combine offline RL techniques (to avoid distributional shift), with risk-sensitive RL algorithms (to achieve risk-aversion). In this work, we propose risk-aversion as a mechanism to jointly address both of these issues. We propose a model-based approach, and use an ensemble of models to estimate epistemic uncertainty, in addition to aleatoric uncertainty. We train a policy that is risk-averse, and avoids high uncertainty actions. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that are risky due to environment stochasticity. Thus, by considering epistemic uncertainty via a model ensemble and introducing risk-aversion, our algorithm (1R2R) avoids distributional shift in addition to achieving risk-aversion to aleatoric risk. Our experiments show that 1R2R achieves strong performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.

PDF Details

AAAI Conference 2023 Conference Paper

Planning with Hidden Parameter Polynomial MDPs

Clarissa Costen
Marc Rigter
Bruno Lacerda
Nick Hawes

For many applications of Markov Decision Processes (MDPs), the transition function cannot be specified exactly. Bayes-Adaptive MDPs (BAMDPs) extend MDPs to consider transition probabilities governed by latent parameters. To act optimally in BAMDPs, one must maintain a belief distribution over the latent parameters. Typically, this distribution is described by a set of sample (particle) MDPs, and associated weights which represent the likelihood of a sample MDP being the true underlying MDP. However, as the number of dimensions of the latent parameter space increases, the number of sample MDPs required to sufficiently represent the belief distribution grows exponentially. Thus, maintaining an accurate belief in the form of a set of sample MDPs over complex latent spaces is computationally intensive, which in turn affects the performance of planning for these models. In this paper, we propose an alternative approach for maintaining the belief over the latent parameters. We consider a class of BAMDPs where the transition probabilities can be expressed in closed form as a polynomial of the latent parameters, and outline a method to maintain a closed-form belief distribution for the latent parameters which results in an accurate belief representation. Furthermore, the closed-form representation does away with the need to tune the number of sample MDPs required to represent the belief. We evaluate two domains and empirically show that the polynomial, closed-form, belief representation results in better plans than a sampling-based belief representation.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Risk-Constrained Planning for Multi-Agent Systems with Shared Resources

Anna Gautier
Marc Rigter
Bruno Lacerda
Nick Hawes
Michael Wooldridge

Planning under uncertainty requires complex reasoning about future events, and this complexity increases with the addition of multiple agents. One problem faced when considering multi-agent systems under uncertainty is the handling of shared resources. Adding a resource constraint limits the actions that agents can take, forcing collaborative decision making on who gets to use what resources. Prior work has considered different formulations, such as satisfying a resource constraint in expectation or ensuring that a resource constraint is met some percent of the time. However, these formulations of constrained planning ignore important distributional information about resource usage. Namely, they do not consider how bad the worst cases can get. In this paper, we formulate a risk-constrained shared resource problem and aim to limit the risk of excessive use of such resources. We focus on optimising for reward while constraining the Conditional Value-at-Risk (CVaR) of the shared resource. While CVaR is well studied in the single-agent setting, we consider the challenges that arise from the state and action space explosion in the multi-agent setting. In particular, we exploit risk contributions, a measure introduced in finance research which quantifies how much individual agents affect the joint risk. We present an algorithm that uses risk contributions to iteratively update single-agent policies until the joint risk constraint is satisfied. We evaluate our algorithm on two synthetic domains.

PDF

AAAI Conference 2022 Conference Paper

Optimal Admission Control for Multiclass Queues with Time-Varying Arrival Rates via State Abstraction

Marc Rigter
Danial Dervovic
Parisa Hassanzadeh
Jason Long
Parisa Zehtabi
Daniele Magazzeni

We consider a novel queuing problem where the decisionmaker must choose to accept or reject randomly arriving tasks into a no buffer queue which are processed by N identical servers. Each task has a price, which is a positive real number, and a class. Each class of task has a different price distribution, service rate, and arrives according to an inhomogenous Poisson process. The objective is to decide which tasks to accept so that the total price of tasks processed is maximised over a finite horizon. We formulate the problem using a discrete time Markov Decision Process (MDP) with a hybrid state space. We show that the optimal value function has a specific structure, which enables us to solve the hybrid MDP exactly. Moreover, we rigorously prove that as the gap between successive decision epochs grows smaller, the discrete time solution approaches the optimal solution to the original continuous time problem. To improve the scalability of our approach to a greater number of servers and task classes, we present an approximation based on state abstraction. We validate our approach on synthetic data, as well as a real financial fraud data set, which is the motivating application for this work.

PDF Details

ICAPS Conference 2022 Conference Paper

Planning for Risk-Aversion and Expected Value in MDPs

Marc Rigter
Paul Duckworth
Bruno Lacerda
Nick Hawes

Planning in Markov decision processes (MDPs) typically optimises the expected cost. However, optimising the expectation does not consider the risk that for any given run of the MDP, the total cost received may be unacceptably high. An alternative approach is to find a policy which optimises a riskaverse objective such as conditional value at risk (CVaR). However, optimising the CVaR alone may result in poor performance in expectation. In this work, we begin by showing that there can be multiple policies which obtain the optimal CVaR. This motivates us to propose a lexicographic approach which minimises the expected cost subject to the constraint that the CVaR of the total cost is optimal. We present an algorithm for this problem and evaluate our approach on four domains. Our results demonstrate that our lexicographic approach improves the expected cost compared to the state of the art algorithm, while achieving the optimal CVaR.

Details

NeurIPS Conference 2022 Conference Paper

RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning

Marc Rigter
Bruno Lacerda
Nick Hawes

Offline reinforcement learning (RL) aims to find performant policies from logged data without further environment interaction. Model-based algorithms, which learn a model of the environment from the dataset and perform conservative policy optimisation within that model, have emerged as a promising approach to this problem. In this work, we present Robust Adversarial Model-Based Offline RL (RAMBO), a novel approach to model-based offline RL. We formulate the problem as a two-player zero sum game against an adversarial environment model. The model is trained to minimise the value function while still accurately predicting the transitions in the dataset, forcing the policy to act conservatively in areas not covered by the dataset. To approximately solve the two-player game, we alternate between optimising the policy and adversarially optimising the model. The problem formulation that we address is theoretically grounded, resulting in a probably approximately correct (PAC) performance guarantee and a pessimistic value function which lower bounds the value function in the true environment. We evaluate our approach on widely studied offline RL benchmarks, and demonstrate that it outperforms existing state-of-the-art baselines.

PDF Details

IJCAI Conference 2022 Conference Paper

Shared Autonomy Systems with Stochastic Operator Models

Clarissa Costen
Marc Rigter
Bruno Lacerda
Nick Hawes

We consider shared autonomy systems where multiple operators (AI and human), can interact with the environment, e. g. by controlling a robot. The decision problem for the shared autonomy system is to select which operator takes control at each timestep, such that a reward specifying the intended system behaviour is maximised. The performance of the human operator is influenced by unobserved factors, such as fatigue or skill level. Therefore, the system must reason over stochastic models of operator performance. We present a framework for stochastic operators in shared autonomy systems (SO-SAS), where we represent operators using rich, partially observable models. We formalise SO-SAS as a mixed-observability Markov decision process, where environment states are fully observable and internal operator states are hidden. We test SO-SAS on a simulated domain and a computer game, empirically showing it results in better performance compared to traditional formulations of shared autonomy systems.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Minimax Regret Optimisation for Robust Planning in Uncertain Markov Decision Processes

Marc Rigter
Bruno Lacerda
Nick Hawes

The parameters for a Markov Decision Process (MDP) often cannot be specified exactly. Uncertain MDPs (UMDPs) capture this model ambiguity by defining sets which the parameters belong to. Minimax regret has been proposed as an objective for planning in UMDPs to find robust policies which are not overly conservative. In this work, we focus on planning for Stochastic Shortest Path (SSP) UMDPs with uncertain cost and transition functions. We introduce a Bellman equation to compute the regret for a policy. We propose a dynamic programming algorithm that utilises the regret Bellman equation, and show that it optimises minimax regret exactly for UMDPs with independent uncertainties. For coupled uncertainties, we extend our approach to use options to enable a trade off between computation and solution quality. We evaluate our approach on both synthetic and real-world domains, showing that it significantly outperforms existing baselines.

PDF Details

NeurIPS Conference 2021 Conference Paper

Risk-Averse Bayes-Adaptive Reinforcement Learning

Marc Rigter
Bruno Lacerda
Nick Hawes

In this work, we address risk-averse Bayes-adaptive reinforcement learning. We pose the problem of optimising the conditional value at risk (CVaR) of the total return in Bayes-adaptive Markov decision processes (MDPs). We show that a policy optimising CVaR in this setting is risk-averse to both the epistemic uncertainty due to the prior distribution over MDPs, and the aleatoric uncertainty due to the inherent stochasticity of MDPs. We reformulate the problem as a two-player stochastic game and propose an approximate algorithm based on Monte Carlo tree search and Bayesian optimisation. Our experiments demonstrate that our approach significantly outperforms baseline approaches for this problem.

PDF Details

IROS Conference 2019 Conference Paper

An Autonomous Quadrotor System for Robust High-Speed Flight Through Cluttered Environments Without GPS

Marc Rigter
Benjamin Morrell
Robert G. Reid
Gene B. Merewether
Theodore Tzanetos
Vinay Rajur
K. C. Wong 0001
Larry H. Matthies

Robust autonomous flight without GPS is key to many emerging drone applications, such as delivery, search and rescue, and warehouse inspection. These and other applications require accurate trajectory tracking through cluttered static environments, where GPS can be unreliable, while high-speed, agile, flight can increase efficiency. We describe the hardware and software of a quadrotor system that meets these requirements with onboard processing: a custom 300 mm wide quadrotor that uses two wide-field-of-view cameras for visualinertial motion tracking and relocalization to a prior map. Collision-free trajectories are planned offline and tracked online with a custom tracking controller. This controller includes compensation for drag and variability in propeller performance, enabling accurate trajectory tracking, even at high speeds where aerodynamic effects are significant. We describe a system identification approach that identifies quadrotor-specific parameters via maximum likelihood estimation from flight data. Results from flight experiments are presented, which 1) validate the system identification method, 2) show that our controller with aerodynamic compensation reduces tracking error by more than 50% in both horizontal flights at up to 8. 5 m/s and vertical flights at up to 3. 1 m/s compared to the state-of-the-art, and 3) demonstrate our system tracking complex, aggressive, trajectories.

Details

ICRA Conference 2018 Conference Paper

Differential Flatness Transformations for Aggressive Quadrotor Flight

Benjamin Morrell
Marc Rigter
Gene B. Merewether
Robert G. Reid
Rohan Thakker
Theodore Tzanetos
Vinay Rajur
Gregory E. Chamitoff

Aggressive maneuvering amongst obstacles could enable advanced capabilities for quadrotors in applications such as search and rescue, surveillance, inspection, and situations where rapid flight is required in cluttered environments. Previous works have treated quadrotors as differentially flat systems, and this property has been exploited widely to design simple algorithms that generate dynamically feasible trajectories and to enable hierarchical control. The differentially flat property allows the full state of the quadrotor to be extracted from the reduced dimensional space of x, y, z, yaw and their derivatives. This differential flatness transformation has a number of singularities, however, as well as stability issues when controlling near these singularities. Many methods have been described in the literature to address these; however, they all have limitations when exploring the full flight envelope of a quadrotor, including roll or pitch angles past 90°, and during inverted flight. In this paper, we review these existing methods and then introduce our method, which combines multiple methods to provide a highly-robust differential flatness transformation that addresses most of these issues. Our approach is demonstrated enabling highly-aggressive quadrotor flight in both simulations and real-world experiments.

Details