Zhouhao Sun Papers

NeurIPS Conference 2025 Conference Paper

UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection

Yang Zhao
Kai Xiong
Xiao Ding
Li Du
Yangou Ouyang
Zhouhao Sun
Jiannan Guan
Wenbin Zhang

A primary impediment to scaling reinforcement learning (RL) for large language model (LLM) training is the substantial computational cost, predominantly arising from the necessity of multi-sampling for policy optimization and evaluation. This underscores the critical yet challenging nature of efficient training data selection. Drawing inspiration from the Zone of Proximal Development (ZPD) theory, which posits that learners acquire knowledge more effectively from tasks of intermediate difficulty, we hypothesize that LLMs exhibit optimal learning from data they have not yet mastered but demonstrate the potential to comprehend. Conventional methodologies for assessing data difficulty or informativeness typically rely on computationally intensive multi-sampling or iterative procedures. To address this limitation, we introduce UFO-RL (**U**ncertainty-**F**ocused **O**ptimization for **R**einforcement **L**earning), a novel framework that employs a computationally efficient single-pass uncertainty estimation technique to identify informative training instances. This method, requiring only a single forward pass and obviating the need for iterative next-token computation, achieves a significant acceleration (up to 185$\times$) in data evaluation compared to multi-sampling approaches. UFO-RL leverages this efficient metric to select data within the model's estimated ZPD for training. Extensive experimentation across diverse LLMs and mathematical benchmarks demonstrates that training with a mere 10\% of the data, carefully selected by UFO-RL, yields performance comparable to or even surpassing that of full-data training. Furthermore, this targeted data selection results in up to a 16$\times$ reduction in overall training time, concurrently enhancing training stability and improving generalization capabilities. Thus, UFO-RL presents a practical and highly efficient strategy for scaling RL fine-tuning of LLMs by focusing learning efforts on the most informative and valuable data, thereby mitigating the computational bottlenecks associated with traditional RL training.

PDF Details

AAAI Conference 2023 Conference Paper

Self-Supervised Logic Induction for Explainable Fuzzy Temporal Commonsense Reasoning

Bibo Cai
Xiao Ding
Zhouhao Sun
Bing Qin
Ting Liu
Baojun Wang
Lifeng Shang

Understanding temporal commonsense concepts, such as times of occurrence and durations is crucial for event-centric language understanding. Reasoning about such temporal concepts in a complex context requires reasoning over both the stated context and the world knowledge that underlines it. A recent study shows massive pre-trained LM still struggle with such temporal reasoning under complex contexts (e.g., dialog) because they only implicitly encode the relevant contexts and fail to explicitly uncover the underlying logical compositions for complex inference, thus may not be robust enough. In this work, we propose to augment LMs with the temporal logic induction ability, which frames the temporal reasoning by defining three modular components: temporal dependency inducer and temporal concept defuzzifier and logic validator. The former two components disentangle the explicit/implicit dependency between temporal concepts across context (before, after,...) and the specific meaning of fuzzy temporal concepts, respectively, while the validator combines the intermediate reasoning clues for robust contextual reasoning about the temporal concepts. Extensive experimental results on TIMEDIAL, a challenging dataset for temporal reasoning over dialog, show that our method, Logic Induction Enhanced Contextualized TEmporal Reasoning (LECTER), can yield great improvements over the traditional language model for temporal reasoning.

PDF Details DOI

Possible papers

UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection

Self-Supervised Logic Induction for Explainable Fuzzy Temporal Commonsense Reasoning