Elizabeth (Beth) Barnes

Founder & CEO at METR, building evaluations so we know if we're getting close to very risky AI. Formerly at DeepMind and OpenAI.

Some research highlights:

Measuring AI ability to complete long tasks - We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
GPT-5 autonomy evaluation report - We evaluate whether GPT-5 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without: AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI acceleration from AI R&D automation
Resources for Autonomy Evaluations - task suite, evaluation protocol, estimates of the "elicitation gap"
Evaluating LLM Agents on Realistic Autonomous Tasks
Evaluating LLMs trained on code (alignment section)
Obfuscated arguments problem - a problem with recursive-decomposition-based alignment approaches
"Imitative generalisation" - explainer for Paul Christiano's 'Learning the Prior'
Risks from AI persuasion - thoughts on the likelihood and consequences of superhuman persuasion before AGI
Reflection mechanisms as an alignment target - work done by my AI safety camp mentees surveying Mechanical Turkers on their feelings towards different reflection mechanisms

I sometimes post alignment-related thinking here.

Contact me at: beth dot m dot surname at gmail.com

Follow me on Twitter/X.

If you have any feedback for me, I'd love to hear it. You can submit it anonymously (or pseudonymously) here.