Arrow Research search

Author name cluster

Seraphina Nix

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
2 author rows

Possible papers

3

NeurIPS Conference 2025 Conference Paper

Measuring AI Ability to Complete Long Software Tasks

  • Thomas Kwa
  • Ben West
  • Joel Becker
  • Amy Deng
  • Katharyn Garcia
  • Max Hasin
  • Sami Jawhar
  • Megan Kinniment

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as o3 have a 50% time horizon of around 110 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated since 2024. The increase in AI models’ time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results—including their degree of external validity—and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

ICML Conference 2025 Conference Paper

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

  • Hjalmar Wijk
  • Tao Roa Lin
  • Joel Becker
  • Sami Jawhar
  • Neev Parikh
  • Thomas Broadley
  • Lawrence Chan
  • Michael Chen

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4$\times$ higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2$\times$ the score of the top AI agent when both are given 32 total hours (across different attempts).

NeurIPS Conference 2022 Conference Paper

Adversarial training for high-stakes reliability

  • Daniel Ziegler
  • Seraphina Nix
  • Lawrence Chan
  • Tim Bauman
  • Peter Schmidt-Nielsen
  • Tao Lin
  • Adam Scherlis
  • Noa Nabeshima

In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques---including a tool that assists human adversaries---to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. We found that adversarial training significantly increased robustness to the adversarial attacks that we trained on--- tripling the time to find adversarial examples without tools and doubling the time with our tool (from 13 to 26 minutes)---without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.