Arrow Research search
Back to TMLR

TMLR 2025

AEAP: A Reinforcement Learning Actor Ensemble Algorithm with Adaptive Pruning

Journal Article Articles Artificial Intelligence ยท Machine Learning

Abstract

Actor ensemble reinforcement learning methods have shown promising performance on dense-reward continuous control tasks. However, they exhibit three primary limitations: (1) diversity collapse when using a shared replay buffer, often necessitating carefully tuned regularization terms; (2) computational overhead from maintaining multiple actors; and (3) analytically intractable policy gradients when using stochastic policies in ensembles, requiring approximations that may compromise performance. To address this third limitation, we restrict the ensemble to deterministic policies and propose Actor Ensemble with Adaptive Pruning (AEAP), a multi-actor deterministic policy gradient algorithm that tackles the remaining limitations through a two-stage approach. First, to alleviate diversity collapse, AEAP employs dual-randomized actor selection that decorrelates exploration and learning by randomly choosing different actors for both environment interaction and policy update. This approach also removes reliance on explicit regularization. Second, when convergence to homogeneous policies still occurs over time, computational efficiency is further achieved through adaptive dual-criterion pruning, which progressively removes underperforming or redundant actors based on critic-estimated value and action-space similarity. Although AEAP introduces four additional hyperparameters compared to TD3 (a baseline single-actor deterministic policy gradient algorithm), we provide two domain-agnostic parameter configurations that perform robustly across environments without requiring tuning. AEAP achieves superior or competitive asymptotic performance compared to baselines across six dense-reward MuJoCo tasks. On sparse-reward Fetch benchmarks, AEAP outperforms deterministic policy gradient methods but falls short of SAC (a baseline stochastic policy gradient algorithm) on one of three tasks. When compared to fixed-size multi-actor baselines, AEAP reduces wall-clock time without sacrificing performance, establishing it as an efficient and reliable actor ensemble variant.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Transactions on Machine Learning Research
Archive span
2022-2026
Indexed papers
3849
Paper id
50662926421238760