Author name cluster

Evan Mays

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers

1 author row

ICLR Conference 2025 Conference Paper

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan
Neil Chowdhury
Oliver Jaffe
James Aung
Dane Sherburn
Evan Mays
Giulio Starace
Kevin Liu

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup — OpenAI's o1-preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code https://github.com/openai/mle-bench to facilitate future research in understanding the ML engineering capabilities of AI agents.

Details

ICML Conference 2025 Conference Paper

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace
Oliver Jaffe
Dane Sherburn
James Aung
Jun Shern Chan
Leon Maksin
Rachel Dias
Evan Mays

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8, 316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3. 5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21. 0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https: //github. com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.

Details