Arrow Research search

Author name cluster

Simone Parisi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

NeurIPS Conference 2024 Conference Paper

Beyond Optimism: Exploration With Partially Observable Rewards

  • Simone Parisi
  • Alireza Kazemipour
  • Michael Bowling

Exploration in reinforcement learning (RL) remains an open challenge. RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. To improve exploration and reward discovery, popular algorithms rely on optimism. But what if sometimes rewards are unobservable, e. g. , situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty. With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.

AAMAS Conference 2024 Conference Paper

Monitored Markov Decision Processes

  • Simone Parisi
  • Montaser Mohammedalamen
  • Alireza Kazemipour
  • Matthew E. Taylor
  • Michael Bowling

In reinforcement learning (RL), an agent learns to perform a task by interacting with an environment and receiving feedback (a numerical reward) for its actions. However, the assumption that rewards are always observable is often not applicable in real-world problems. For example, the agent may need to ask a human to supervise its actions or activate a monitoring system to receive feedback. There may even be a period of time before rewards become observable, or a period of time after which rewards are no longer given. In other words, there are cases where the environment generates rewards in response to the agent’s actions but the agent cannot observe them. In this paper, we formalize a novel but general RL framework — Monitored MDPs — where the agent cannot always observe rewards. We discuss the theoretical and practical consequences of this setting, show challenges raised even in toy environments, and propose algorithms to begin to tackle this novel setting. This paper introduces a powerful new formalism that encompasses both new and existing problems and lays the foundation for future research.

ICML Conference 2022 Conference Paper

The Unsurprising Effectiveness of Pre-Trained Vision Models for Control

  • Simone Parisi
  • Aravind Rajeswaran
  • Senthil Purushwalkam
  • Abhinav Gupta 0001

Recent years have seen the emergence of pre-trained representations as a powerful abstraction for AI applications in computer vision, natural language, and speech. However, policy learning for control is still dominated by a tabula-rasa learning paradigm, with visuo-motor policies often trained from scratch using data from deployment environments. In this context, we revisit and study the role of pre-trained visual representations for control, and in particular representations trained on large-scale computer vision datasets. Through extensive empirical evaluation in diverse control domains (Habitat, DeepMind Control, Adroit, Franka Kitchen), we isolate and study the importance of different representation training methods, data augmentations, and feature hierarchies. Overall, we find that pre-trained visual representations can be competitive or even better than ground-truth state representations to train control policies. This is in spite of using only out-of-domain data from standard vision datasets, without any in-domain data from the deployment environments.

NeurIPS Conference 2021 Conference Paper

Interesting Object, Curious Agent: Learning Task-Agnostic Exploration

  • Simone Parisi
  • Victoria Dean
  • Deepak Pathak
  • Abhinav Gupta

Common approaches for task-agnostic exploration learn tabula-rasa --the agent assumes isolated environments and no prior knowledge or experience. However, in the real world, agents learn in many environments and always come with prior experiences as they explore new ones. Exploration is a lifelong process. In this paper, we propose a paradigm change in the formulation and evaluation of task-agnostic exploration. In this setup, the agent first learns to explore across many environments without any extrinsic goal in a task-agnostic manner. Later on, the agent effectively transfers the learned exploration policy to better explore new environments when solving tasks. In this context, we evaluate several baseline exploration strategies and present a simple yet effective approach to learning task-agnostic exploration policies. Our key idea is that there are two components of exploration: (1) an agent-centric component encouraging exploration of unseen parts of the environment based on an agent’s belief; (2) an environment-centric component encouraging exploration of inherently interesting objects. We show that our formulation is effective and provides the most consistent exploration across several training-testing environment pairs. We also introduce benchmarks and metrics for evaluating task-agnostic exploration strategies. The source code is available at https: //github. com/sparisi/cbet/.

EWRL Workshop 2018 Workshop Paper

TD-Regularized Actor-Critic Methods

  • Simone Parisi
  • Voot Tangkaratt
  • Jan Peters
  • Mohammad Khan

Actor-critic methods can achieve incredible performance on difficult reinforcement-learning problems, but they are also prone to instability due to the interplay between the actor and critic during learning. To improve their stability, we propose a novel TD-regularized actorcritic method. Our method regularizes the learning objective of the actor by penalizing the temporal difference error of the critic. This improves stability by avoiding overconfident steps in the actor update when the critic is highly inaccurate. We show that our TD-regularization can be easily applied to existing actor-critic methods, e.g., deterministic policy gradient and trust-region policy optimization, with only a slight increase in computation. Evaluations on standard benchmarks show that our method improves stability and exhibits better performance and data-efficiency than its non-regularized counterparts.

IROS Conference 2017 Conference Paper

Goal-driven dimensionality reduction for reinforcement learning

  • Simone Parisi
  • Simon Ramstedt
  • Jan Peters 0001

Defining a state representation on which optimal control can perform well is a tedious but crucial process. It typically requires expert knowledge, does not generalize straightforwardly over different tasks and strongly influences the quality of the learned controller. In this paper, we present an autonomous feature construction method for learning low-dimensional manifolds of goal-relevant features jointly with an optimal controller using reinforcement learning. Our method combines information-theoretic algorithms with principal component analysis to performs a return-weighted reduction of the state representation. The method does not require any preprocessing of the data, does not assume strong restrictions on the state representation, and substantially improves the performance of learning by reducing the number of samples required. We show that our method can learn high quality controller in redundant spaces, even from pixels, and outperforms both classical and state-of-the-art deep learning approaches.

AAAI Conference 2017 Conference Paper

Policy Search with High-Dimensional Context Variables

  • Voot Tangkaratt
  • Herke van Hoof
  • Simone Parisi
  • Gerhard Neumann
  • Jan Peters
  • Masashi Sugiyama

Direct contextual policy search methods learn to improve policy parameters and simultaneously generalize these parameters to different context or task variables. However, learning from high-dimensional context variables, such as camera images, is still a prominent problem in many real-world tasks. A naive application of unsupervised dimensionality reduction methods to the context variables, such as principal component analysis, is insufficient as task-relevant input may be ignored. In this paper, we propose a contextual policy search method in the model-based relative entropy stochastic search framework with integrated dimensionality reduction. We learn a model of the reward that is locally quadratic in both the policy parameters and the context variables. Furthermore, we perform supervised linear dimensionality reduction on the context variables by nuclear norm regularization. The experimental results show that the proposed method outperforms naive dimensionality reduction via principal component analysis and a state-of-the-art contextual policy search method.

RLDM Conference 2017 Conference Abstract

Regularized Contextual Policy Search via Mutual Information

  • Simone Parisi
  • Voot Tangkaratt
  • Jan Peters

Contextual policy search algorithms are black-box optimizers that learn to improve policy pa- rameters and simultaneously generalize these parameters to different context or task variables. However, defining a context representation on which policy search can perform well is a tedious but crucial pro- cess. It typically requires expert knowledge, does not generalize straightforwardly over different tasks and strongly influences the quality of the learned policy. Furthermore, existing algorithms usually perform di- mensionality reduction taking into account only feature redundancy and relevance, ignoring the problem of feature interaction. In this paper, we present an autonomous feature construction algorithm for learning low-dimensional manifolds of goal-relevant features jointly with an optimal policy. We learn a model of the reward that is locally quadratic in both the policy parameters and the context variables. To tackle high di- mensional context variables and to take into account feature interaction, we propose to regularize the model by mutual information.

JAIR Journal 2016 Journal Article

Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation

  • Simone Parisi
  • Matteo Pirotta
  • Marcello Restelli

Many real-world control applications, from economics to robotics, are characterized by the presence of multiple conflicting objectives. In these problems, the standard concept of optimality is replaced by Pareto-optimality and the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. In this paper, we propose a reinforcement learning policy gradient approach to learn a continuous approximation of the Pareto frontier in multi-objective Markov Decision Problems (MOMDPs). Differently from previous policy gradient algorithms, where n optimization routines are executed to have n solutions, our approach performs a single gradient ascent run, generating at each step an improved continuous approximation of the Pareto frontier. The idea is to optimize the parameters of a function defining a manifold in the policy parameters space, so that the corresponding image in the objectives space gets as close as possible to the true Pareto frontier. Besides deriving how to compute and estimate such gradient, we will also discuss the non-trivial issue of defining a metric to assess the quality of the candidate Pareto frontiers. Finally, the properties of the proposed approach are empirically evaluated on two problems, a linear-quadratic Gaussian regulator and a water reservoir control task.

AAAI Conference 2015 Conference Paper

Multi-Objective Reinforcement Learning with Continuous Pareto Frontier Approximation

  • Matteo Pirotta
  • Simone Parisi
  • Marcello Restelli

This paper is about learning a continuous approximation of the Pareto frontier in Multi–Objective Markov Decision Problems (MOMDPs). We propose a policy–based approach that exploits gradient information to generate solutions close to the Pareto ones. Differently from previous policy–gradient multi–objective algorithms, where n optimization routines are used to have n solutions, our approach performs a single gradient–ascent run that at each step generates an improved continuous approximation of the Pareto frontier. The idea is to exploit a gradient–based approach to optimize the parameters of a function that defines a manifold in the policy parameter space so that the corresponding image in the objective space gets as close as possible to the Pareto frontier. Besides deriving how to compute and estimate such gradient, we will also discuss the non–trivial issue of defining a metric to assess the quality of the candidate Pareto frontiers. Finally, the properties of the proposed approach are empirically evaluated on two interesting MOMDPs.

IROS Conference 2015 Conference Paper

Reinforcement learning vs human programming in tetherball robot games

  • Simone Parisi
  • Hany Abdulsamad
  • Alexandros Paraschos
  • Christian G. Daniel
  • Jan Peters 0001

Reinforcement learning of motor skills is an important challenge in order to endow robots with the ability to learn a wide range of skills and solve complex tasks. However, comparing reinforcement learning against human programming is not straightforward. In this paper, we create a motor learning framework consisting of state-of-the-art components in motor skill learning and compare it to a manually designed program on the task of robot tetherball. We use dynamical motor primitives for representing the robot's trajectories and relative entropy policy search to train the motor framework and improve its behavior by trial and error. These algorithmic components allow for high-quality skill learning while the experimental setup enables an accurate evaluation of our framework as robot players can compete against each other. In the complex game of robot tetherball, we show that our learning approach outperforms and wins a match against a high quality hand-crafted system.