When Instinct Guides and Insight Grounds: Staged RL Training for LLM Agents

Zijing Zhang; Boning Zhang

doi:10.1609/aaai.v40i41.40794

Back to AAAI

AAAI 2026

When Instinct Guides and Insight Grounds: Staged RL Training for LLM Agents

Conference Paper AAAI Technical Track on Natural Language Processing VI Artificial Intelligence

PDF Details DOI

Abstract

Large Language Model (LLM) agents have demonstrated strong potential in complex, interactive decision-making tasks. However, when training LLM agents end-to-end with reinforcement learning (RL), efficiently optimizing agent policies in dynamic environments remains a significant challenge. Existing RL-based LLM agent paradigms commonly organize interactions in a cycle where reasoning is followed by action. In our work, we observe a phenomenon we call Exploration Contraction, where the explicit introduction of a reasoning stage reduces the diversity of actions—quantified by lower action entropy—which in turn limits exploration and leads to premature policy convergence. To address this limitation, we propose Act-before-Reasoning (ActRe), a two-stage RL training framework. In the first stage, we reverse the typical rollout order, prompting the agent to generate actions prior to reasoning, which encourages exploration driven by model intuition. In the second stage, we restore the standard reasoning-then-action order for training and evaluation, ensuring robust and interpretable decision-making. Experiments on the ALFWorld and WebShop benchmarks show that ActRe effectively mitigates exploration contraction, yielding consistently higher task success rates and improved training robustness compared to strong RL baselines. Our analysis underscores the importance of action entropy in the exploration-exploitation trade-off during LLM agent training and provides a practical approach to maintain the benefits of explicit reasoning while promoting sufficient exploration.

When Instinct Guides and Insight Grounds: Staged RL Training for LLM Agents

Abstract

Authors

Keywords

Context