AAMAS Conference 2025 Conference Paper
CPE: A New Paradigm for Policy Extraction in Offline Reinforcement Learning
- Zhaohui Yang
- Xiaoxuan Wang
- Linjing Li
Offline reinforcement learning (RL) aims to extract the optimal policy from static offline datasets but always encounters the notorious distribution shift problem. In order to address this problem, many previous offline RL algorithms primarily rely on modifications at policy evaluation stage. However, the performance gap between different policy extraction methods is significant even under the same value function. Thus, to address this issue, we focuses on the policy extraction stage and introduces a novel policy extraction method called Contrastive Policy Extraction (CPE), which samples action pairs at each state and leverages their relative values to improve the policy. By reformulating the optimal policy parameterization problem as a root-finding problem, CPE enhances the policy extraction capability and surpasses current prominent extraction methods in offline RL, such as AWAC and TD3BC. The proposed CPE is implemented within the iterative actor-critc framework and it substantially outperforms current state-of-the-art (SOTA) offline RL algorithms on D4RL benchmarks.