Pavel Kolev Papers

EWRL Workshop 2025 Workshop Paper

SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models

Cansu Sancaktar
Christian Gumbsch
Andrii Zadaianchuk
Pavel Kolev
Georg Martius

Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task-based rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low-level interactions. In contrast, children’s play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language-embedded environments or access to high-level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model-based RL agents with an intrinsic mo- tivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model-based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game-like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low-level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful.

PDF

ICML Conference 2025 Conference Paper

SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models

Cansu Sancaktar
Christian Gumbsch
Andrii Zadaianchuk
Pavel Kolev
Georg Martius

Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task-based rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low-level interactions. In contrast, children’s play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language-embedded environments or access to high-level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model-based RL agents with an intrinsic motivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model-based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game-like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low-level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful.

Details

EWRL Workshop 2024 Workshop Paper

Dual-Force: Enhanced Offline Diversity Maximization under Imitation Constraints

Pavel Kolev
Marin Vlastelica
Georg Martius

While many algorithms for diversity maximization under imitation constraints are online in nature, many applications require offline algorithms without environment interactions. Tackling this problem in the offline setting, however, presents significant challenges that require non-trivial, multi-stage optimization processes with non-stationary rewards. In this work, we present a novel offline algorithm that enhances diversity using an objective based on Van der Waals (VdW) force and successor features, and eliminates the need to learn a previously used skill discriminator. Moreover, by conditioning the value function and policy on a pre-trained Functional Reward Encoding (FRE), our method allows for better handling of non-stationary rewards and provides zero-shot recall of all skills encountered during training, significantly expanding the set of skills learned in prior work. Consequently, our algorithm benefits from receiving a consistently strong diversity signal (VdW), and enjoys more stable and efficient training. We demonstrate the effectiveness of our method in generating diverse skills for two robotic tasks in simulation: locomotion of a quadruped and local navigation with obstacle traversal.

PDF

ICRA Conference 2024 Conference Paper

Learning Diverse Skills for Local Navigation under Multi-constraint Optimality

Jin Cheng 0002
Marin Vlastelica
Pavel Kolev
Chenhao Li
Georg Martius

Despite many successful applications of data-driven control in robotics, extracting meaningful diverse behaviors remains a challenge. Typically, task performance needs to be compromised in order to achieve diversity. In many scenarios, task requirements are specified as a multitude of reward terms, each requiring a different trade-off. In this work, we take a constrained optimization viewpoint on the quality-diversity trade-off and show that we can obtain diverse policies while imposing constraints on their value functions which are defined through distinct rewards. In line with previous work, further control of the diversity level can be achieved through an attract-repel reward term motivated by the Van der Waals force. We demonstrate the effectiveness of our method on a local navigation task where a quadruped robot needs to reach the target within a finite horizon. Finally, our trained policies transfer well to the real 12-DoF quadruped robot, Solo12, and exhibit diverse agile behaviors with successful obstacle traversal.

Details

RLC Conference 2024 Conference Paper

Offline Diversity Maximization under Imitation Constraints

Marin Vlastelica
Jin Cheng
Georg Martius
Pavel Kolev

There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.

PDF Details

RLJ Journal 2024 Journal Article

Offline Diversity Maximization under Imitation Constraints

Marin Vlastelica
Jin Cheng
Georg Martius
Pavel Kolev

There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.

PDF Details

ICLR Conference 2023 Conference Paper

Benchmarking Offline Reinforcement Learning on Real-Robot Hardware

Nico Gürtler
Sebastian Blaes
Pavel Kolev
Felix Widmaier
Manuel Wüthrich
Stefan Bauer
Bernhard Schölkopf
Georg Martius

Learning policies from previously recorded data is a promising direction for real-world robotics tasks, as online learning is often infeasible. Dexterous manipulation in particular remains an open problem in its general form. The combination of offline reinforcement learning with large diverse datasets, however, has the potential to lead to a breakthrough in this challenging domain analogously to the rapid progress made in supervised learning in recent years. To coordinate the efforts of the research community toward tackling this problem, we propose a benchmark including: i) a large collection of data for offline learning from a dexterous manipulation platform on two tasks, obtained with capable RL agents trained in simulation; ii) the option to execute learned policies on a real-world robotic system and a simulation for efficient debugging. We evaluate prominent open-sourced offline reinforcement learning algorithms on the datasets and provide a reproducible experimental setup for offline reinforcement learning on real systems.

Details

EWRL Workshop 2023 Workshop Paper

Diverse Offline Imitation via Fenchel Duality

Marin Vlastelica
Pavel Kolev
Jin Cheng
Georg Martius

There has been significant recent progress in the area of unsupervised skill discovery, with various works proposing mutual information based objectives, as a source of intrinsic motivation. Prior works predominantly focused on designing algorithms that require online access to the environment. In contrast, we develop an offline skill discovery algorithm. Our problem formulation considers the maximization of a mutual information objective constrained by a KL-divergence. More precisely, the constraints ensure that the state occupancy of each skill remains close to the state occupancy of an expert, within the support of an offline dataset with good state-action coverage. Our main contribution is to connect Fenchel-Rockafellar duality, reinforcement learning and unsupervised skill discovery, and to give a simple offline algorithm for learning diverse skills that are aligned with an expert.

PDF

EWRL Workshop 2023 Workshop Paper

On Imitation in Mean-field Games

Giorgia Ramponi
Pavel Kolev
Olivier Pietquin
Niao He
Mathieu Lauriere
Matthieu Geist

We explore the problem of imitation learning (IL) in the context of mean-field games (MFGs), where the goal is to imitate the behavior of a population of agents following a Nash equilibrium policy according to some unknown payoff function. IL in MFGs presents new challenges compared to single-agent IL, particularly when both the reward function and the transition kernel depend on the population distribution. In this paper, departing from the existing literature on IL for MFGs, we introduce a new solution concept called the Nash imitation gap. Then we show that when only the reward depends on the population distribution, IL in MFGs can be reduced to single-agent IL with similar guarantees. However, when the dynamics is population-dependent, we provide a novel upper-bound that suggests IL is harder in this setting. To address this issue, we propose a new adversarial formulation where the reinforcement learning problem is replaced by a mean-field control (MFC) problem, suggesting progress in IL within MFGs may have to build upon MFC.

PDF

NeurIPS Conference 2023 Conference Paper

On Imitation in Mean-field Games

Giorgia Ramponi
Pavel Kolev
Olivier Pietquin
Niao He
Mathieu Lauriere
Matthieu Geist

We explore the problem of imitation learning (IL) in the context of mean-field games (MFGs), where the goal is to imitate the behavior of a population of agents following a Nash equilibrium policy according to some unknown payoff function. IL in MFGs presents new challenges compared to single-agent IL, particularly when both the reward function and the transition kernel depend on the population distribution. In this paper, departing from the existing literature on IL for MFGs, we introduce a new solution concept called the Nash imitation gap. Then we show that when only the reward depends on the population distribution, IL in MFGs can be reduced to single-agent IL with similar guarantees. However, when the dynamics is population-dependent, we provide a novel upper-bound that suggests IL is harder in this setting. To address this issue, we propose a new adversarial formulation where the reinforcement learning problem is replaced by a mean-field control (MFC) problem, suggesting progress in IL within MFGs may have to build upon MFC.

PDF Details

NeurIPS Conference 2023 Conference Paper

Online Learning under Adversarial Nonlinear Constraints

Pavel Kolev
Georg Martius
Michael Muehlebach

In many applications, learning systems are required to process continuous non-stationary data streams. We study this problem in an online learning framework and propose an algorithm that can deal with adversarial time-varying and nonlinear constraints. As we show in our work, the algorithm called Constraint Violation Velocity Projection (CVV-Pro) achieves $\sqrt{T}$ regret and converges to the feasible set at a rate of $1/\sqrt{T}$, despite the fact that the feasible set is slowly time-varying and a priori unknown to the learner. CVV-Pro only relies on local sparse linear approximations of the feasible set and therefore avoids optimizing over the entire set at each iteration, which is in sharp contrast to projected gradients or Frank-Wolfe methods. We also empirically evaluate our algorithm on two-player games, where the players are subjected to a shared constraint.

PDF Details

EWRL Workshop 2023 Workshop Paper

Online Learning under Adversarial Nonlinear Constraints

Pavel Kolev
Georg Martius
Michael Muehlebach

In many applications, learning systems are required to process continuous non-stationary data streams. We study this problem in an online learning framework and propose an algorithm that can deal with adversarial time-varying and nonlinear constraints. As we show in our work, the algorithm called Constraint Violation Velocity Projection (CVV-Pro) achieves $\sqrt{T}$ regret and converges to the feasible set at a rate of $1/\sqrt{T}$, despite the fact that the feasible set is slowly time-varying and a priori unknown to the learner. CVV-Pro only relies on local sparse linear approximations of the feasible set and therefore avoids optimizing over the entire set at each iteration, which is in sharp contrast to projected gradients or Frank-Wolfe methods. We also empirically evaluate our algorithm on two-player games, where the players are subjected to a shared constraint.

PDF

ICRA Conference 2023 Conference Paper

Versatile Skill Control via Self-supervised Adversarial Imitation of Unlabeled Mixed Motions

Chenhao Li
Sebastian Blaes
Pavel Kolev
Marin Vlastelica
Jonas Frey
Georg Martius

Learning diverse skills is one of the main challenges in robotics. To this end, imitation learning approaches have achieved impressive results. These methods require explicitly labeled datasets or assume consistent skill execution to enable learning and active control of individual behaviors, which limits their applicability. In this work, we propose a cooperative adversarial method for obtaining single versatile policies with controllable skill sets from unlabeled datasets containing diverse state transition patterns by maximizing their discriminability. Moreover, we show that by utilizing unsupervised skill discovery in the generative adversarial imitation learning framework, novel and useful skills emerge with successful task fulfillment. Finally, the obtained versatile policies are tested on an agile quadruped robot called Solo 8 and present faithful replications of diverse skills encoded in the demonstrations.

Details

TCS Journal 2022 Journal Article

Physarum-inspired multi-commodity flow dynamics

Vincenzo Bonifaci
Enrico Facca
Frederic Folz
Andreas Karrenbauer
Pavel Kolev
Kurt Mehlhorn
Giovanna Morigi
Golnoosh Shahkarami

Details DOI

TCS Journal 2020 Journal Article

Convergence of the non-uniform directed Physarum model

Enrico Facca
Andreas Karrenbauer
Pavel Kolev
Kurt Mehlhorn

Details DOI

TCS Journal 2020 Journal Article

Convergence of the non-uniform Physarum dynamics

Andreas Karrenbauer
Pavel Kolev
Kurt Mehlhorn

Details DOI

NeurIPS Conference 2020 Conference Paper

Secretary and Online Matching Problems with Machine Learned Advice

Antonios Antoniadis
Themis Gouleakis
Pieter Kleer
Pavel Kolev

The classical analysis of online algorithms, due to its worst-case nature, can be quite pessimistic when the input instance at hand is far from worst-case. Often this is not an issue with machine learning approaches, which shine in exploiting patterns in past inputs in order to predict the future. However, such predictions, although usually accurate, can be arbitrarily poor. Inspired by a recent line of work, we augment three well-known online settings with machine learned predictions about the future, and develop algorithms that take them into account. In particular, we study the following online selection problems: (i) the classical secretary problem, (ii) online bipartite matching and (iii) the graphic matroid secretary problem. Our algorithms still come with a worst-case performance guarantee in the case that predictions are subpar while obtaining an improved competitive ratio (over the best-known classical online algorithm for each problem) when the predictions are sufficiently accurate. For each algorithm, we establish a trade-off between the competitive ratios obtained in the two respective cases.

PDF Details

SODA Conference 2019 Conference Paper

A PTAS for ℓp-Low Rank Approximation

Frank Ban
Vijay Bhattiprolu
Karl Bringmann
Pavel Kolev
Euiwoong Lee
David P. Woodruff

A number of recent works have studied algorithms for entrywise ℓ p -low rank approximation, namely algorithms which given an n × d matrix A (with n ≥ d ), output a rank- k matrix B minimizing ‖ A – B ‖ p p = ∑ i, j |A i, j – B i, j | p when p > 0; and ‖ A – B ‖0 = ∑ i, j [ A i, j ≠ B i, j ] for p = 0, where [·] is the Iverson bracket, that is, ‖ A – B ‖ 0 denotes the number of entries ( i, j ) for which A i, j ≠ B i, j. For p = 1, this is often considered more robust than the SVD, while for p = 0 this corresponds to minimizing the number of disagreements, or robust PCA. This problem is known to be NP-hard for p ∊ {0, 1}, already for k = 1, and while there are polynomial time approximation algorithms, their approximation factor is at best poly( k ). It was left open if there was a polynomial-time approximation scheme (PTAS) for ℓ p -approximation for any p ≥ 0. We show the following: 1. On the algorithmic side, for p ∊ (0, 2), we give the first n poly( k / ε ) time (1 + ε )-approximation algorithm. For p = 0, there are various problem formulations, a common one being the binary setting in which A ∊ {0, 1} n × d and B = U · V, where U ∊ {0, 1} n × k and V ∊ {0, 1} k × d. There are also various notions of multiplication U · V, such as a matrix product over the reals, over a finite field, or over a Boolean semiring. We give the first almost-linear time approximation scheme for what we call the Generalized Binary ℓ 0 - Rank-k problem, for which these variants are special cases. Our algorithm computes (1 + ε )-approximation in time (1/ ε ) 2 O ( k ) / ε 2 · nd 1 + o (1), where o (1) hides a factor (log log d ) 1. 1 / log d. In addition, for the case of finite fields of constant size, we obtain an alternate PTAS running in time n · d poly( k / ε ). 2. On the hardness front, for p ∊ (1, 2), we show under the Small Set Expansion Hypothesis and Exponential Time Hypothesis (ETH), there is no constant factor approximation algorithm running in time 2 kδ for a constant δ > 0, showing an exponential dependence on k is necessary. For p = 0, we observe that there is no approximation algorithm for the Generalized Binary ℓ 0 -Rank- k problem running in time 2 2 δk for a constant δ > 0. We also show for finite fields of constant size, under the ETH, that any fixed constant factor approximation algorithm requires 2 kδ time for a constant δ > 0.

Details

TCS Journal 2019 Journal Article

Two results on slime mold computations

Ruben Becker
Vincenzo Bonifaci
Andreas Karrenbauer
Pavel Kolev
Kurt Mehlhorn

Details DOI

NeurIPS Conference 2017 Conference Paper

Approximation Algorithms for $\ell_0$-Low Rank Approximation

Karl Bringmann
Pavel Kolev
David Woodruff

We study the $\ell_0$-Low Rank Approximation Problem, where the goal is, given an $m \times n$ matrix $A$, to output a rank-$k$ matrix $A'$ for which $\|A'-A\|_0$ is minimized. Here, for a matrix $B$, $\|B\|_0$ denotes the number of its non-zero entries. This NP-hard variant of low rank approximation is natural for problems with no underlying metric, and its goal is to minimize the number of disagreeing data positions. We provide approximation algorithms which significantly improve the running time and approximation factor of previous work. For $k > 1$, we show how to find, in poly$(mn)$ time for every $k$, a rank $O(k \log(n/k))$ matrix $A'$ for which $\|A'-A\|_0 \leq O(k^2 \log(n/k)) \OPT$. To the best of our knowledge, this is the first algorithm with provable guarantees for the $\ell_0$-Low Rank Approximation Problem for $k > 1$, even for bicriteria algorithms. For the well-studied case when $k = 1$, we give a $(2+\epsilon)$-approximation in {\it sublinear time}, which is impossible for other variants of low rank approximation such as for the Frobenius norm. We strengthen this for the well-studied case of binary matrices to obtain a $(1+O(\psi))$-approximation in sublinear time, where $\psi = \OPT/\nnz{A}$. For small $\psi$, our approximation factor is $1+o(1)$.

PDF Details

Possible papers