Claude Sonnet 4.6 narrows the SRE gap

OraCore Editors

[RSCH] June 30, 20266 min readOraCore Editors

Claude Sonnet 4.6 narrows the SRE gap

Rootly’s benchmark shows Claude Sonnet 4.6 closing much of the gap with Opus 4.6 on SRE tasks, especially incident investigations.

Share LinkedIn

Rootly found Claude Sonnet 4.6 nearly matches Opus 4.6 on incident investigations.

Rootly ran Claude Sonnet 4.6 through its SRE benchmark the same day Anthropic announced it, and the results were more nuanced than a simple scorecard. On the company’s internal incident-evaluation suite, Sonnet 4.6 tracked closely with Claude Opus 4.6 on root-cause accuracy, while costing about 40% less per token in the agentic workflow Rootly cares about most.

That matters because incident response is not a trivia contest. The model has to read logs, reason across services, follow causal chains, and decide when a symptom is a clue versus noise. Rootly’s own takeaway is that the best model for AI SRE may depend on the task, not the brand name on the model card.

Model	SRE-skills-bench	Output cost per M
opus-4.6	94.7%	$25.00
opus-4.5	94.6%	$25.00
sonnet-4.6	90.4%	$15.00
sonnet-4.5	85.9%	$15.00

Sonnet 4.6 made the biggest jump

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The headline number from Rootly’s SRE-skills-bench is simple: Sonnet 4.6 scored 90.4%, up from 85.9% for Sonnet 4.5. That is a gain of 4.5 points at the same $15.00 per million output tokens.

Opus barely moved. Opus 4.6 scored 94.7%, just ahead of Opus 4.5 at 94.6%, and both cost $25.00 per million output tokens. Rootly’s read is that Anthropic improved the Sonnet tier more than the Opus tier in this release.

The benchmark itself is aimed at the work SREs actually do: understanding infrastructure code, reasoning about cloud configurations, and mapping code diffs to real pull requests. That makes it more useful than a generic coding benchmark for teams building incident tooling.

Sonnet 4.6: 90.4% at $15.00 per million output tokens
Sonnet 4.5: 85.9% at $15.00 per million output tokens
Opus 4.6: 94.7% at $25.00 per million output tokens
Opus 4.5: 94.6% at $25.00 per million output tokens

The gaps vary by task

Rootly broke the benchmark down by domain, and that is where the story gets more interesting. Sonnet 4.6 beat Opus 4.6 on general SRE knowledge, tied it on AWS networking, and stayed close on Kubernetes and compute. But the model lost ground on IAM and S3, where policy boundaries and permission logic get much trickier.

“We experimented with our agentic workflows: investigating incidents, correlating signals, and reasoning through causal chains.” — Sylvain Kalache, Rootly

That quote gets to the point of the post. Rootly is not testing a model in isolation. It is testing how the model behaves inside an incident workflow, where the agent has to collect evidence first and reason later. In that setting, adaptive reasoning matters more than a static benchmark score.

Here is the per-task split Rootly published:

GMCQ: Sonnet 88.0%, Opus 87.0%
Azure Compute: Sonnet 92.6%, Opus 95.6%
Azure Storage: Sonnet 92.2%, Opus 96.1%
Kubernetes: Sonnet 94.5%, Opus 97.3%
AWS Compute: Sonnet 94.3%, Opus 96.6%
AWS Network: Sonnet 97.1%, Opus 97.1%
AWS IAM: Sonnet 85.2%, Opus 92.2%
AWS S3: Sonnet 75.7%, Opus 91.9%

The biggest spread is in AWS S3, where Opus leads by 16.2 points. AWS IAM is next, with a 7-point gap. Those are the kinds of tasks where a routing system makes sense: send policy-heavy questions to Opus, keep broader infrastructure work on Sonnet, and cut the average cost without giving up too much accuracy.

Agentic incident work changes the picture

Rootly says the benchmark numbers do not fully capture what happens during a live incident. Its AI SRE has to pull metrics and logs, trace faults across services, and narrow the issue to a root cause before suggesting a fix. That is a longer chain of reasoning than a multiple-choice answer or a single-turn code task.

On Rootly’s internal incident suite, Sonnet 4.6 performed similarly to Opus 4.6 on root-cause accuracy, and in some cases beat it. Both models outperformed Opus 4.5 on the hardest investigations, but Sonnet 4.6 did it at about 40% lower per-token cost.

That result lines up with Anthropic’s new adaptive thinking system. The model can spend less effort while gathering evidence and more effort once it starts forming a diagnosis. For incident response, that is a good fit because the early phase is mostly retrieval and correlation, while the late phase is about deciding which failure chain actually explains the outage.

Rootly also points to two other Claude 4.6 features that matter for AI SRE work:

A 1M-token context window, which helps when logs and traces get long
Context compaction, which summarizes older turns during extended investigations
Improved prompt-injection resistance, useful when agents read untrusted logs and webhook payloads
Four effort levels for adaptive thinking: low, medium, high, and max

What this means for AI SRE teams

The practical lesson is that one model does not need to do everything. If your incident assistant handles Kubernetes triage, cloud compute questions, and broad SRE knowledge, Sonnet 4.6 looks strong enough to carry a lot of the load. If it has to reason through IAM policies or S3 permission boundaries, Opus still has a clear edge.

That suggests a routing strategy that is more like operations than model worship. Put the cheaper model on the common path, escalate the hard policy cases, and keep the expensive calls for the questions that really need them. For teams watching cloud spend, that is a cleaner tradeoff than defaulting every incident to the most expensive model.

Rootly says it runs every frontier model through SRE-skills-bench on launch, and it publishes the leaderboard at sreskillsbench.com. That kind of public, domain-specific evaluation is useful because it rewards the thing SRE teams actually care about: fewer wrong turns during an outage.

The bigger question now is whether other incident tools will copy this split-model approach. If Sonnet 4.6 can handle the bulk of investigation work while Opus picks up the hardest policy and permission cases, AI SRE products may start to look less like a single monolithic assistant and more like a routed system with different models for different failure modes.

// Related Articles

Claude Sonnet 4.6 narrows the SRE gap

Sonnet 4.6 made the biggest jump

Get the latest AI news in your inbox

The gaps vary by task

Agentic incident work changes the picture

What this means for AI SRE teams

GLM 5.2 beats Claude in Semgrep’s IDOR test

OPD lets you distill skills without brute-force RL

Google DeepMind turns science into tools

Measuring when LLM behavior actually переносится

Prompt injection is now an AI security problem

Solver choice changes which Nash equilibrium wins