Anthropic’s own data says AI is already building AI

OraCore Editors

Back to home

[RSCH] June 12, 20265 min readOraCore Editors

Anthropic’s own data says AI is already building AI

Anthropic’s data shows AI is already accelerating AI development, and that should alarm every serious builder.

Anthropic SWE-Bench Claude AI coding agents

Share LinkedIn

Anthropic’s own data says AI is already building AI

Anthropic says AI is already speeding up the work of building better AI.

Anthropic is right to frame recursive self-improvement as a present engineering trend, not a distant sci-fi scenario. Its own numbers show a company moving from human-first development to AI-assisted development at a pace that would have sounded absurd two years ago: more than 80% of merged code was authored by Claude as of May 2026, and engineer output is up about 8x versus 2024. That is not a marginal productivity gain. It is a structural change in how frontier systems get built, reviewed, and shipped.

AI is already compounding inside the lab

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The strongest evidence in Anthropic’s piece is not a benchmark score but the internal shift in labor. Before Claude Code launched in early 2025, Claude-authored code sat in the low single digits. By May 2026, it had crossed 80%. That means the company is no longer using AI mainly as a suggestion layer. It is using AI as a production layer, with engineers directing and reviewing rather than typing every line themselves.

That matters because compounding begins when the tool stops being a helper and starts becoming part of the machine that improves the tool itself. Anthropic says engineers are now shipping 8x as much code per quarter as they did from 2021 to 2025, and that the slope steepened again when models started working autonomously over longer horizons. Even if lines of code overstate true productivity, the direction is unmistakable: the bottleneck has shifted from writing code to supervising code generation.

Benchmarks are telling the same story

Anthropic’s external evidence reinforces the internal data. Tasks models can reliably complete on their own have been doubling roughly every four months, faster than the prior seven-month pace. The examples are concrete: Claude Opus 3 handled tasks that took humans about four minutes in March 2024, Sonnet 3.7 handled about 90-minute tasks a year later, and Opus 4.6 reached 12-hour tasks a year after that. If that curve holds, days-long work enters range this year and weeks-long work in 2027.

Software and research benchmarks tell the same story. SWE-bench went from low single digits to saturation in two years. CORE-Bench, which tests whether a model can reproduce published research, rose from about 20% success in 2024 to saturation fifteen months later. These are not vanity metrics. They measure whether a system can actually execute the kinds of work that feed the next generation of models. Once models can reliably reproduce, debug, and optimize the pipeline, the distance between assistance and self-improvement narrows fast.

The counter-argument

The skeptic’s case is serious: Anthropic is still far from a model that chooses its own goals, decides which research directions matter, and redesigns itself end to end. The company says as much. Claude can already execute well-specified work and match or beat skilled humans on some experiments, but it still lags on judgment, goal selection, and open-ended prioritization. That gap is real, and it is the difference between a powerful coding agent and a system that truly closes the loop.

There is also a measurement problem. Lines of code is a crude proxy, and benchmark saturation does not equal general intelligence. A model that crushes SWE-bench can still fail on messy organizational judgment, security tradeoffs, or long-horizon strategy. The most important work in frontier AI is not just implementation. It is deciding what to build, what to test, and what to trust.

All of that is true, but it does not weaken the main conclusion. Recursive self-improvement does not need full autonomy on day one. It only needs enough capability to move more of the pipeline from humans to machines, step by step, until the machine does a larger share of the work that improves the next machine. Anthropic’s own data shows that shift is already underway. Waiting for a perfect closed loop before taking it seriously is a category error.

What to do with this

If you are an engineer, stop treating AI as a faster autocomplete and start treating it as a labor multiplier that changes review, testing, and incident response. If you are a PM, assume that task decomposition, spec writing, and evaluation design matter more every quarter. If you are a founder, build your roadmap around the fact that the cost of shipping software is falling, but the cost of judging what should ship is rising in strategic importance. The winners will not be the teams that ask AI to write more code. They will be the teams that redesign the whole workflow around machine-generated work and human oversight.

// Related Articles

Anthropic’s own data says AI is already building AI

AI is already compounding inside the lab

Get the latest AI news in your inbox

Benchmarks are telling the same story

The counter-argument

What to do with this

OpenAI’s agent hack forces tighter eval controls

CARE routes LoRA experts by confidence

πR² makes flow policies react in real time

Relay-OPD fixes prefix failure in distillation

Learning from Multiple Data Providers

Certified parallel Sinkhorn speeds up dynamic OT