MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan; Neil Chowdhury; Oliver Jaffe; James Aung; Dane Sherburn; Evan Mays; Giulio Starace; Kevin Liu; Leon Maksin; Tejal Patwardhan; Aleksander Madry; Lilian Weng

Back to ICLR

ICLR 2025

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Conference Paper Accept (Oral) Artificial Intelligence · Machine Learning

Details

Abstract

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup — OpenAI's o1-preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code https://github.com/openai/mle-bench to facilitate future research in understanding the ML engineering capabilities of AI agents.

Authors

Keywords

benchmark
evals
evaluations
dataset
tasks
data science
engineering
agents
language agents
scaffold
coding
swe
mle

Context

Venue: International Conference on Learning Representations
Archive span: 2013-2025
Indexed papers: 10294
Paper id: 229869582058607012