Arrow Research search
Back to NeurIPS

NeurIPS 2025

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

Abstract

Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1. 5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with significantly smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2. 5-7B from 17. 4% to an impressive 57. 3%, and Qwen2. 5-14B from 23. 3% to 62. 5%, surpassing o3-mini (low) by 3. 1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16. 15%, outperforming the frontier-level QWQ-32B. rStar-Coder dataset is publicly available at https: //huggingface. co/datasets/microsoft/rStar-Coder.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Annual Conference on Neural Information Processing Systems
Archive span
1987-2025
Indexed papers
30776
Paper id
895337705029572612