EWRL Workshop 2025 Workshop Paper
Barycenter Policy Design for Multiple Policy Evaluation
- Simon Weissmann
- Till Freihaut
- Claire Vernade
- Giorgia Ramponi
- Leif Döring
A growing challenge in reinforcement learning is to efficiently explore the action space to evaluate multiple target policies using importance sampling. When target policies share similarities, leveraging these resemblances in the behavior policy is crucial for sample efficiency. However, formally defining and algorithmically utilizing such similarities remains an open problem. This article introduces a behavior policy design, examining how different criteria for selecting a behavior policy influence importance sampling estimator properties. We evaluate the resulting behavior policies in downstream tasks, particularly in best policy selection problems. Additionally, we demonstrate how effectively leveraging similarities among target policies results in a more nuanced behavior policy design and enhances regret bounds for best policy selection. To facilitate rigorous analysis, the article is formulated within the stochastic bandit framework.