Arrow Research search
Back to NeurIPS

NeurIPS 2021

Measuring Coding Challenge Competence With APPS

Conference Paper Datasets and Benchmarks Track (round2) Artificial Intelligence ยท Machine Learning

Abstract

While programming is one of the most broadly applicable skills in modern society, it is unclear how well state-of-the-art machine learning models can write code. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to assess code generation performance in an accurate and rigorous manner. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we evaluate models by checking their generated code on test cases. Our benchmark includes 10, 000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an objective measure for tracking advancements.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Annual Conference on Neural Information Processing Systems
Archive span
1987-2025
Indexed papers
30776
Paper id
1047236890824731052