Author name cluster

Markus Wulfmeier

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

ICML Conference 2025 Conference Paper

EvoControl: Multi-Frequency Bi-Level Control for High-Frequency Continuous Control

Samuel Holt
Todor Davchev
Dhruva Tirumala
Ben Moran
Atil Iscen
Antoine Laurens
Yixin Lin
Erik Frey

High-frequency control in continuous action and state spaces is essential for practical applications in the physical world. Directly applying end-to-end reinforcement learning to high-frequency control tasks struggles with assigning credit to actions across long temporal horizons, compounded by the difficulty of efficient exploration. The alternative, learning low-frequency policies that guide higher-frequency controllers (e. g. , proportional-derivative (PD) controllers), can result in a limited total expressiveness of the combined control system, hindering overall performance. We introduce EvoControl, a novel bi-level policy learning framework for learning both a slow high-level policy (using PPO) and a fast low-level policy (using Evolution Strategies) for solving continuous control tasks. Learning with Evolution Strategies for the lower-policy allows robust learning for long horizons that crucially arise when operating at higher frequencies. This enables EvoControl to learn to control interactions at a high frequency, benefitting from more efficient exploration and credit assignment than direct high-frequency torque control without the need to hand-tune PD parameters. We empirically demonstrate that EvoControl can achieve a higher evaluation reward for continuous-control tasks compared to existing approaches, specifically excelling in tasks where high-frequency control is needed, such as those requiring safety-critical fast reactions.

Details

IROS Conference 2025 Conference Paper

Exploiting Policy Idling for Dexterous Manipulation

Annie S. Chen
Philemon Brakel
Antonia Bronars
Annie Xie
Sandy Han Huang
Oliver Groth
Maria Bauzá 0001
Markus Wulfmeier

Learning based methods for dexterous manipulation have made notable progress in recent years, and they can now produce solutions to complex tasks. However, learned policies often still lack reliability and exhibit limited robustness to important factors of variation. One failure pattern that can be observed across many settings is that policies idle, i. e. they cease to move beyond a small region of states, often indefinitely, when they reach certain states. This policy idling is often a reflection of the training data. For instance, it can occur when the data contains small actions in areas where the robot needs to perform high-precision motions, e. g. , when preparing to grasp an object or object insertion. Prior works have tried to mitigate this phenomenon e. g. by filtering the training data or modifying the control frequency. However, these approaches can negatively impact policy performance in other ways. As an alternative, we investigate how to leverage the detectability of idling behavior to inform exploration and policy improvement. Our approach, Pause-Induced Perturbations (PIP), applies perturbations at detected idling states, thus helping it to escape problematic basins of attraction. On a range of challenging simulated dual-arm tasks, we find that this simple approach can already noticeably improve test-time performance, with no additional supervision or training. Furthermore, since the robot tends to idle at critical points in a movement, we also find that learning from the resulting episodes leads to better iterative policy improvement compared to prior approaches. Our perturbation strategy also leads to a 15-35% improvement in absolute success rate on a real-world insertion task that requires complex multi-finger manipulation.

Details

NeurIPS Conference 2024 Conference Paper

Imitating Language via Scalable Inverse Reinforcement Learning

Markus Wulfmeier
Michael Bloesch
Nino Vieillard
Arun Ahuja
Jörg Bornschein
Sandy Huang
Artem Sokolov
Matt Barnes

The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Massively Scalable Inverse Reinforcement Learning in Google Maps

Matt Barnes 0001
Matthew Abueg
Oliver F. Lange
Matt Deeds
Jason Trader
Denali Molitor
Markus Wulfmeier
Shawn O'Banion

Inverse reinforcement learning (IRL) offers a powerful and general framework for learning humans' latent preferences in route recommendation, yet no approach has successfully addressed planetary-scale problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic IRL methods in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies.

Details

ICRA Conference 2024 Conference Paper

Mastering Stacking of Diverse Shapes with Large-Scale Iterative Reinforcement Learning on Real Robots

Thomas Lampe
Abbas Abdolmaleki
Sarah Bechtle
Sandy Han Huang
Jost Tobias Springenberg
Michael Bloesch
Oliver Groth
Roland Hafner

Reinforcement learning solely from an agent’s self-generated data is often believed to be infeasible for learning on real robots, due to the amount of data needed. However, if done right, agents learning from real data can be surprisingly efficient through re-using previously collected sub-optimal data. In this paper we demonstrate how the increased understanding of off-policy learning methods and their embedding in an iterative online/offline scheme ("collect and infer") can drastically improve data-efficiency by using all the collected experience, which empowers learning from real robot experience only. Moreover, the resulting policy improves significantly over the state of the art on a recently proposed real robot manipulation benchmark. Our approach learns end-to-end, directly from pixels, and does not rely on additional human domain knowledge such as a simulator or demonstrations.

Details

ICLR Conference 2024 Conference Paper

Replay across Experiments: A Natural Extension of Off-Policy RL

Dhruva Tirumala
Thomas Lampe
José Enrique Chen
Tuomas Haarnoja
Sandy Han Huang
Guy Lever
Ben Moran
Tim Hertweck

Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.

Details

TMLR Journal 2023 Journal Article

SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration

Giulia Vezzani
Dhruva Tirumala
Markus Wulfmeier
Dushyant Rao
Abbas Abdolmaleki
Ben Moran
Tuomas Haarnoja
Jan Humplik

The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations. For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution. It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of different components of our method.

PDF Details

ICLR Conference 2023 Conference Paper

Solving Continuous Control via Q-learning

Tim Seyde
Peter Werner
Wilko Schwarting
Igor Gilitschenski
Martin A. Riedmiller
Daniela Rus
Markus Wulfmeier

While there has been substantial success for solving continuous control with actor-critic methods, simpler critic-only methods such as Q-learning find limited application in the associated high-dimensional action spaces. However, most actor-critic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements and wider hyperparameter search spaces. We show that a simple modification of deep Q-learning largely alleviates these issues. By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a variety of continuous control tasks.

Details

ICLR Conference 2022 Conference Paper

Learning transferable motor skills with hierarchical latent mixture policies

Dushyant Rao
Fereshteh Sadeghi
Leonard Hasenclever
Markus Wulfmeier
Martina Zambelli
Giulia Vezzani
Dhruva Tirumala
Yusuf Aytar

For robots operating in the real world, it is desirable to learn reusable abstract behaviours that can effectively be transferred across numerous tasks and scenarios. We propose an approach to learn skills from data using a hierarchical mixture latent variable model. Our method exploits a multi-level hierarchy of both discrete and continuous latent variables, to model a discrete set of abstract high-level behaviours while allowing for variance in how they are executed. We demonstrate in manipulation domains that the method can effectively cluster offline data into distinct, executable behaviours, while retaining the flexibility of a continuous latent variable model. The resulting skills can be transferred to new tasks, unseen objects, and from state to vision-based policies, yielding significantly better sample efficiency and asymptotic performance compared to existing skill- and imitation-based methods. We also perform further analysis showing how and when the skills are most beneficial: they encourage directed exploration to cover large regions of the state space relevant to the task, making them most effective in challenging sparse-reward settings.

Details

ICLR Conference 2022 Conference Paper

Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation

Todor Davchev
Oleg Sushkov
Jean-Baptiste Regli
Stefan Schaal
Yusuf Aytar
Markus Wulfmeier
Jonathan Scholz

Complex sequential tasks in continuous-control settings often require agents to successfully traverse a set of ``narrow passages'' in their state space. Solving such tasks with a sparse reward in a sample-efficient manner poses a challenge to modern reinforcement learning (RL) due to the associated long-horizon nature of the problem and the lack of sufficient positive signal during learning. Various tools have been applied to address this challenge. When available, large sets of demonstrations can guide agent exploration. Hindsight relabelling on the other hand does not require additional sources of information. However, existing strategies explore based on task-agnostic goal distributions, which can render the solution of long-horizon tasks impractical. In this work, we extend hindsight relabelling mechanisms to guide exploration along task-specific distributions implied by a small set of successful demonstrations. We evaluate the approach on four complex, single and dual arm, robotics manipulation tasks against strong suitable baselines. The method requires far fewer demonstrations to solve all tasks and achieves a significantly higher overall performance as task complexity increases. Finally, we investigate the robustness of the proposed solution with respect to the quality of input representations and the number of demonstrations.

Details

ICML Conference 2021 Conference Paper

Data-efficient Hindsight Off-policy Option Learning

Markus Wulfmeier
Dushyant Rao
Roland Hafner
Thomas Lampe
Abbas Abdolmaleki
Tim Hertweck
Michael Neunert
Dhruva Tirumala

We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. The results highlight the importance of both types of abstraction as well as off-policy training and trust-region constraints, particularly in challenging, simulated 3D robot manipulation tasks from raw pixel inputs. Finally, we intuitively adapt the inference step to investigate the effect of increased temporal abstraction on training with pre-trained options and from scratch.

Details

NeurIPS Conference 2021 Conference Paper

Is Bang-Bang Control All You Need? Solving Continuous Control with Bernoulli Policies

Tim Seyde
Igor Gilitschenski
Wilko Schwarting
Bartolomeo Stellato
Martin Riedmiller
Markus Wulfmeier
Daniela Rus

Reinforcement learning (RL) for continuous control typically employs distributions whose support covers the entire action space. In this work, we investigate the colloquially known phenomenon that trained agents often prefer actions at the boundaries of that space. We draw theoretical connections to the emergence of bang-bang behavior in optimal control, and provide extensive empirical evaluation across a variety of recent RL algorithms. We replace the normal Gaussian by a Bernoulli distribution that solely considers the extremes along each action dimension - a bang-bang controller. Surprisingly, this achieves state-of-the-art performance on several continuous control benchmarks - in contrast to robotic hardware, where energy and maintenance cost affect controller choices. Since exploration, learning, and the final solution are entangled in RL, we provide additional imitation learning experiments to reduce the impact of exploration on our analysis. Finally, we show that our observations generalize to environments that aim to model real-world challenges and evaluate factors to mitigate the emergence of bang-bang solutions. Our findings emphasise challenges for benchmarking continuous control algorithms, particularly in light of potential real-world applications.

PDF Details

ICRA Conference 2021 Conference Paper

Representation Matters: Improving Perception and Exploration for Robotics

Markus Wulfmeier
Arunkumar Byravan
Tim Hertweck
Irina Higgins
Ankush Gupta
Tejas Kulkarni
Malcolm Reynolds
Denis Teplyashin

Projecting high-dimensional environment observations into lower-dimensional structured representations can considerably improve data-efficiency for reinforcement learning in domains with limited data such as robotics. Can a single generally useful representation be found? In order to answer this question, it is important to understand how the representation will be used by the agent and what properties such a good representation should have. In this paper we systematically evaluate a number of common learnt and hand-engineered representations in the context of three robotics tasks: lifting, stacking and pushing of 3D blocks. The representations are evaluated in two use-cases: as input to the agent, or as a source of auxiliary tasks. Furthermore, the value of each representation is evaluated in terms of three properties: dimensionality, observability and disentanglement. We can significantly improve performance in both use-cases and demonstrate that some representations can perform commensurate to simulator states as agent inputs. Finally, our results challenge common intuitions by demonstrating that: 1) dimensionality strongly matters for task generation, but is negligible for inputs, 2) observability of task-relevant aspects mostly affects the input representation use-case, and 3) disentanglement leads to better auxiliary tasks, but has only limited benefits for input representations. This work serves as a step towards a more systematic understanding of what makes a good representation for control in robotics, enabling practitioners to make more informed choices for developing new learned or hand-engineered representations.

Details

ICRA Conference 2018 Conference Paper

Incremental Adversarial Domain Adaptation for Continually Changing Environments

Markus Wulfmeier
Alex Bewley
Ingmar Posner

Continuous appearance shifts such as changes in weather and lighting conditions can impact the performance of deployed machine learning models. While unsupervised domain adaptation aims to address this challenge, current approaches do not utilise the continuity of the occurring shifts. In particular, many robotics applications exhibit these conditions and thus facilitate the potential to incrementally adapt a learnt model over minor shifts which integrate to massive differences over time. Our work presents an adversarial approach for lifelong, incremental domain adaptation which benefits from unsupervised alignment to a series of intermediate domains which successively diverge from the labelled source domain. We empirically demonstrate that our incremental approach improves handling of large appearance changes, e. g. day to night, on a traversable-path segmentation task compared with a direct, single alignment step approach. Furthermore, by approximating the feature distribution for the source domain with a generative adversarial network, the deployment module can be rendered fully independent of retaining potentially large amounts of the related source training data for only a minor reduction in performance.

Details

ICML Conference 2018 Conference Paper

TACO: Learning Task Decomposition via Temporal Alignment for Control

Kyriacos Shiarlis
Markus Wulfmeier
Sasha Salter
Shimon Whiteson
Ingmar Posner

Many advanced Learning from Demonstration (LfD) methods consider the decomposition of complex, real-world tasks into simpler sub-tasks. By reusing the corresponding sub-policies within and between tasks, we can provide training data for each policy from different high-level tasks and compose them to perform novel ones. Existing approaches to modular LfD focus either on learning a single high-level task or depend on domain knowledge and temporal segmentation. In contrast, we propose a weakly supervised, domain-agnostic approach based on task sketches, which include only the sequence of sub-tasks performed in each demonstration. Our approach simultaneously aligns the sketches with the observed demonstrations and learns the required sub-policies. This improves generalisation in comparison to separate optimisation procedures. We evaluate the approach on multiple domains, including a simulated 3D robot arm control task using purely image-based observations. The results show that our approach performs commensurately with fully supervised approaches, while requiring significantly less annotation effort.

Details

IROS Conference 2017 Conference Paper

Addressing appearance change in outdoor robotics with adversarial domain adaptation

Markus Wulfmeier
Alex Bewley
Ingmar Posner

Appearance changes due to weather and seasonal conditions represent a strong impediment to the robust implementation of machine learning systems in outdoor robotics. While supervised learning optimises a model for the training domain, it will deliver degraded performance in application domains that underlie distributional shifts caused by these changes. Traditionally, this problem has been addressed via the collection of labelled data in multiple domains or by imposing priors on the type of shift between both domains. We frame the problem in the context of unsupervised domain adaptation and develop a framework for applying adversarial techniques to adapt popular, state-of-the-art network architectures with the additional objective to align features across domains. Moreover, as adversarial training is notoriously unstable, we first perform an extensive ablation study, adapting many techniques known to stabilise generative adversarial networks, and evaluate on a surrogate classification task with the same appearance change. The distilled insights are applied to the problem of free-space segmentation for motion planning in autonomous driving.

Details

IROS Conference 2016 Conference Paper

Watch this: Scalable cost-function learning for path planning in urban environments

Markus Wulfmeier
Dominic Zeng Wang
Ingmar Posner

In this work, we present an approach to learn cost maps for driving in complex urban environments from a large number of demonstrations of human driving behaviour. The learned cost maps are constructed directly from raw sensor measurements, bypassing the effort of manually designing cost maps as well as features. When deploying the cost maps, the trajectories generated not only replicate human-like driving behaviour but are also demonstrably robust against systematic errors in putative robot configuration. To achieve this we deploy a Maximum Entropy based, non-linear IRL framework which uses Fully Convolutional Neural Networks (FCNs) to represent the cost model underlying expert driving behaviour. Using a deep, parametric approach enables us to scale efficiently to large datasets and complex behaviours while being run-time independent of dataset extent during deployment. We demonstrate scalability and performance on an ambitious dataset collected over the course of one year including more than 25k demonstration trajectories extracted from over 120km of driving and 13 different drivers. We evaluate against a carefully designed cost map and, in addition, demonstrate robustness to systematic errors by learning precise cost-maps even in the presence of system calibration perturbations.

Details