RLDM 2013
Complex Bandit Problems and Thompson Sampling
Abstract
We study stochastic multi-armed bandit settings with complex actions derived from the basic bandit arms, e. g. , subsets or partitions of basic arms. The decision maker is faced with selecting at each round a complex action instead of a basic arm. We allow the reward of the complex action to be some func- tion of the basic arms’ rewards, and so the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of bandit arms, we may only observe the maximum reward over the chosen subset. Feedback from playing (complex) actions can thus be indicative of rewards from other actions, and leveraging this coupled feedback becomes important to the decision maker in or- der to learn efficiently. We propose applying Thompson Sampling – a Bayesian-inspired algorithm for the standard multi-armed bandit – for minimizing regret in complex bandit problems. We derive the first gener- al, frequentist regret bound for Thompson sampling in complex bandit settings, that holds without specific structural assumptions on the prior used by the algorithm. The regret bound exhibits the standard logarith- mic scaling with time but with a non-trivial multiplicative constant that encodes the coupled information structure of the complex bandit. As applications, we show improved regret bounds (compared to treating the complex actions as independent) for a class of complex, subset-selection bandit problems. Using particle filters for computing posterior distributions that often lack an explicit closed-form, we apply Thompson- sampling algorithms for subset selection and job-scheduling problems and present numerical results.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- Multidisciplinary Conference on Reinforcement Learning and Decision Making
- Archive span
- 2013-2025
- Indexed papers
- 1004
- Paper id
- 742582260131714973