Shuo Wu Papers

NeurIPS Conference 2025 Conference Paper

Hierachical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Yongqiang Yao
Jingru Tan
Kaihuan Liang
Feizhao Zhang
Jiahao Hu
Shuo Wu
Yazhe Niu
Ruihao Gong

Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MoE model, our method speeds up the training by 2. 4$\times$ with competitive performance. Codes will be released at https: //github. com/ModelTC/HBP.

PDF Details

JAIR Journal 2025 Journal Article

Robust Reward Design for Markov Decision Processes

Shuo Wu
Haoxiang Ma
Jie Fu
Shuo Han

The problem of reward design examines the interaction between a leader and a follower, where the leader aims to shape the follower’s behavior to maximize the leader’s payoff by modifying the follower’s reward function. Current approaches to reward design rely on an accurate model of how the follower responds to reward modifications, which can be sensitive to modeling inaccuracies. To address this issue of sensitivity, we present a solution that offers robustness against uncertainties in modeling the follower, including 1) how the follower breaks ties in the presence of nonunique best responses, 2) inexact knowledge of how the follower perceives reward modifications, and 3) bounded rationality of the follower. Our robust solution is guaranteed to exist under mild conditions and can be obtained numerically by solving a mixed-integer linear program. Numerical experiments on multiple test cases demonstrate that our solution improves robustness compared to the standard approach without incurring significant additional computing costs.

PDF Details DOI

Possible papers

Hierachical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Robust Reward Design for Markov Decision Processes