Arrow Research search
Back to ICLR

ICLR 2025

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Conference Paper Accept (Oral) Artificial Intelligence · Machine Learning

Abstract

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup — OpenAI's o1-preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code https://github.com/openai/mle-bench to facilitate future research in understanding the ML engineering capabilities of AI agents.

Authors

Keywords

  • benchmark
  • evals
  • evaluations
  • dataset
  • tasks
  • data science
  • engineering
  • agents
  • language agents
  • scaffold
  • coding
  • swe
  • mle

Context

Venue
International Conference on Learning Representations
Archive span
2013-2025
Indexed papers
10294
Paper id
229869582058607012