Arrow Research search

Author name cluster

Songjun Tu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

NeurIPS Conference 2025 Conference Paper

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

  • Di He
  • Songjun Tu
  • AJAY JAISWAL
  • Li Shen
  • Ganzhao Yuan
  • Shiwei Liu
  • Lu Yin

Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify “heavy-tailedness. ” Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. The code is available at https: //github. com/hed-ucas/AlphaDecay.

AAAI Conference 2025 Conference Paper

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

  • Songjun Tu
  • Jingbo Sun
  • Qichao Zhang
  • Yaocheng Zhang
  • Jia Liu
  • Ke Chen
  • Dongbin Zhao

Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a balance between maintaining fidelity to the behavior policy with high in-dataset trajectory returns and selecting optimal actions based on high reward labels. Additionally, we introduce an ensemble normalization technique that effectively integrates multiple reward models, balancing the trade-off between reward differentiation and accuracy. Empirical evaluations on various benchmarks demonstrate the superiority of DTR over other state-of-the-art baselines.

NeurIPS Conference 2025 Conference Paper

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

  • Songjun Tu
  • Jiahao Lin
  • Qichao Zhang
  • Xiangyu Tian
  • Linjing Li
  • Xiangyuan Lan
  • Dongbin Zhao

Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities—enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis (". .. ") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy–efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6. 4\% while reducing token usage by 52\% on DeepSeek-R1-Distill-Qwen-1. 5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https: //github. com/ScienceOne-AI/AutoThink.

AAMAS Conference 2025 Conference Paper

Offline Goal-Conditioned Reinforcement Learning with Elastic-Subgoal Diffused Policy Learning

  • Yaocheng Zhang
  • Yuanheng Zhu
  • Yuqian Fu
  • Songjun Tu
  • Dongbin Zhao

Goal-conditioned reinforcement learning (GCRL) aims to learn a policy that generalizes across different goal conditions. Compared to non-hierarchical methods, hierarchical GCRL based on subgoals can alleviate the problem of inaccurately estimating the value function for faraway goals in offline learning scenarios, thereby leading to more effective policy learning. Due to the state complexity of the decision-making process, at different states, we require subgoals from varying future time steps to minimize policy errors caused by noisy value functions, rather than using a fixed future time step for selecting subgoals. Therefore, we propose a hierarchical reinforcement learning algorithm with an elastic subgoal steps, called ESD (Elastic Subgoal Diffused Policy Learning). Our method defines a novel high-level policy in which all reachable states surrounding the current state are considered as potential subgoals, and the optimal subgoal is selected among them. Moreover, we use diffusion models to represent the hierarchical policies, enhancing their ability to capture the multimodal data distribution introduced by the elastic subgoal steps and offline data. We evaluate the performance of ESD on multiple goal-conditioned benchmarks, and it demonstrates superior performance compared to previous baselines. Our method effectively reduces the impact of inaccurate value function estimates on policy accuracy, especially in complex tasks and high-dimensional image observations. Code is available at https: //github. com/zhyaoch/ESD. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

AAMAS Conference 2025 Conference Paper

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

  • Songjun Tu
  • Jingbo Sun
  • Qichao Zhang
  • Xiangyuan Lan
  • Dongbin Zhao

Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most work suppose there is a "scripted teacher" that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL. RL-SaLLM-F leverages the reflective and discriminative capabilities of LLM to generate self-augmented trajectories and provide preference labels for reward learning. First, we identify a failure issue in LLM-based preference discrimination, specifically "query ambiguity", in online PbRL. Then LLM is employed to provide preference labels and generate self-augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double-check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL-SaLLM-F, and shows that self-augmented LLM feedback can effectively replace the impractical "scripted teacher" feedback. In summary, RL-SaLLM-F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM-driven feedback. 1 1Corresponding author: Qichao Zhang (zhangqichao2014@ia. ac. cn) and Xiangyuan Lan (lanxy@pcl. ac. cn). Code Page: https: //github. com/TU2021/RL-SaLLM-F This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

AAMAS Conference 2025 Conference Paper

Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning

  • Jingbo Sun
  • Songjun Tu
  • Qichao Zhang
  • Ke Chen
  • Dongbin Zhao

Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning, where agents often overfit to the specific visual observations of the training environment. In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information. As a result, agents may deviate from the optimal behaviors learned during training, thereby hindering visual generalization. To address this issue, we propose the Salience-Invariant Consistent Policy Learning (SCPL) algorithm, an efficient framework for zero-shot generalization. Our approach introduces a novel value consistency module alongside a dynamics module to effectively capture taskrelevant representations. The value consistency module, guided by saliency, ensures the agent focuses on task-relevant pixels in both original and perturbed observations, while the dynamics module uses augmented data to help the encoder capture dynamicand reward-relevant representations. Additionally, our theoretical analysis highlights the importance of policy consistency for generalization. To strengthen this, we introduce a policy consistency module with a KL divergence constraint to maintain consistent policies across original and perturbed observations. Extensive experiments on the DMC-GB, Robotic Manipulation, and CARLA benchmarks demonstrate that SCPL significantly outperforms state-of-the-art methods in terms of generalization. Notably, SCPL achieves average performance improvements of 14%, 39%, and 69% in the challenging DMC video hard setting, the Robotic hard setting, and the CARLA benchmark, respectively. Project Page: https: //sites. google. com/view/scpl-rl. Corresponding author: Qichao Zhang, Dongbin Zhao. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

ICLR Conference 2025 Conference Paper

Unsupervised Zero-Shot Reinforcement Learning via Dual-Value Forward-Backward Representation

  • Jingbo Sun
  • Songjun Tu
  • Qichao Zhang
  • Haoran Li 0010
  • Xin Liu 0039
  • Yaran Chen
  • Ke Chen
  • Dongbin Zhao

Online unsupervised reinforcement learning (URL) can discover diverse skills via reward-free pre-training and exhibits impressive downstream task adaptation abilities through further fine-tuning. However, online URL methods face challenges in achieving zero-shot generalization, i.e., directly applying pre-trained policies to downstream tasks without additional planning or learning. In this paper, we propose a novel Dual-Value Forward-Backward representation (DVFB) framework with a contrastive entropy intrinsic reward to achieve both zero-shot generalization and fine-tuning adaptation in online URL. On the one hand, we demonstrate that poor exploration in forward-backward representations can lead to limited data diversity in online URL, impairing successor measures, and ultimately constraining generalization ability. To address this issue, the DVFB framework learns successor measures through a skill value function while promoting data diversity through an exploration value function, thus enabling zero-shot generalization. On the other hand, and somewhat surprisingly, by employing a straightforward dual-value fine-tuning scheme combined with a reward mapping technique, the pre-trained policy further enhances its performance through fine-tuning on downstream tasks, building on its zero-shot performance. Through extensive multi-task generalization experiments, DVFB demonstrates both superior zero-shot generalization (outperforming on all 12 tasks) and fine-tuning adaptation (leading on 10 out of 12 tasks) abilities, surpassing state-of-the-art URL methods.