[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-claude-sonnet-46-sre-benchmark-rootly-en":3,"article-related-claude-sonnet-46-sre-benchmark-rootly-en":30,"series-research-e4fcd8f3-1391-4ef3-b44d-1aab77b30fca":73},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"e4fcd8f3-1391-4ef3-b44d-1aab77b30fca","claude-sonnet-46-sre-benchmark-rootly-en","Claude Sonnet 4.6 narrows the SRE gap","\u003Cp data-speakable=\"summary\">Rootly found \u003Ca href=\"\u002Ftag\u002Fclaude\">Claude\u003C\u002Fa> Sonnet 4.6 nearly matches Opus 4.6 on incident investigations.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Frootly.com\" target=\"_blank\" rel=\"noopener\">Rootly\u003C\u002Fa> ran \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fclaude-4-6\" target=\"_blank\" rel=\"noopener\">Claude Sonnet 4.6\u003C\u002Fa> through its SRE \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> the same day \u003Ca href=\"\u002Ftag\u002Fanthropic\">Anthropic\u003C\u002Fa> announced it, and the results were more nuanced than a simple scorecard. On the company’s internal incident-evaluation suite, Sonnet 4.6 tracked closely with \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fclaude-4-6\" target=\"_blank\" rel=\"noopener\">Claude Opus 4.6\u003C\u002Fa> on root-cause accuracy, while costing about 40% less per \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> in the agentic workflow Rootly cares about most.\u003C\u002Fp>\u003Cp>That matters because incident response is not a trivia contest. The model has to read logs, reason across services, follow causal chains, and decide when a symptom is a clue versus noise. Rootly’s own takeaway is that the best model for AI SRE may depend on the task, not the brand name on the model card.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Model\u003C\u002Fth>\u003Cth>SRE-skills-bench\u003C\u002Fth>\u003Cth>Output cost per M\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>opus-4.6\u003C\u002Ftd>\u003Ctd>94.7%\u003C\u002Ftd>\u003Ctd>$25.00\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>opus-4.5\u003C\u002Ftd>\u003Ctd>94.6%\u003C\u002Ftd>\u003Ctd>$25.00\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>sonnet-4.6\u003C\u002Ftd>\u003Ctd>90.4%\u003C\u002Ftd>\u003Ctd>$15.00\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>sonnet-4.5\u003C\u002Ftd>\u003Ctd>85.9%\u003C\u002Ftd>\u003Ctd>$15.00\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Sonnet 4.6 made the biggest jump\u003C\u002Fh2>\u003Cp>The headline number from Rootly’s \u003Ca href=\"https:\u002F\u002Fsreskillsbench.com\" target=\"_blank\" rel=\"noopener\">SRE-skills-bench\u003C\u002Fa> is simple: Sonnet 4.6 scored 90.4%, up from 85.9% for Sonnet 4.5. That is a gain of 4.5 points at the same $15.00 per million output tokens.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782750772754-cmvk.png\" alt=\"Claude Sonnet 4.6 narrows the SRE gap\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Opus barely moved. Opus 4.6 scored 94.7%, just ahead of Opus 4.5 at 94.6%, and both cost $25.00 per million output tokens. Rootly’s read is that Anthropic improved the Sonnet tier more than the Opus tier in this release.\u003C\u002Fp>\u003Cp>The benchmark itself is aimed at the work SREs actually do: understanding infrastructure code, reasoning about cloud configurations, and mapping code diffs to real pull requests. That makes it more useful than a generic \u003Ca href=\"\u002Fnews\u002Fkimi-2-7-price-coding-benchmark-en\">coding benchmark\u003C\u002Fa> for teams building incident tooling.\u003C\u002Fp>\u003Cul>\u003Cli>Sonnet 4.6: 90.4% at $15.00 per million output tokens\u003C\u002Fli>\u003Cli>Sonnet 4.5: 85.9% at $15.00 per million output tokens\u003C\u002Fli>\u003Cli>Opus 4.6: 94.7% at $25.00 per million output tokens\u003C\u002Fli>\u003Cli>Opus 4.5: 94.6% at $25.00 per million output tokens\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>The gaps vary by task\u003C\u002Fh2>\u003Cp>Rootly broke the benchmark down by domain, and that is where the story gets more interesting. Sonnet 4.6 beat Opus 4.6 on general SRE knowledge, tied it on AWS networking, and stayed close on Kubernetes and compute. But the model lost ground on IAM and S3, where policy boundaries and permission logic get much trickier.\u003C\u002Fp>\u003Cblockquote>“We experimented with our agentic workflows: investigating incidents, correlating signals, and reasoning through causal chains.” — Sylvain Kalache, Rootly\u003C\u002Fblockquote>\u003Cp>That quote gets to the point of the post. Rootly is not testing a model in isolation. It is testing how the model behaves inside an incident workflow, where the \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> has to collect evidence first and reason later. In that setting, adaptive reasoning matters more than a static benchmark score.\u003C\u002Fp>\u003Cp>Here is the per-task split Rootly published:\u003C\u002Fp>\u003Cul>\u003Cli>GMCQ: Sonnet 88.0%, Opus 87.0%\u003C\u002Fli>\u003Cli>Azure Compute: Sonnet 92.6%, Opus 95.6%\u003C\u002Fli>\u003Cli>Azure Storage: Sonnet 92.2%, Opus 96.1%\u003C\u002Fli>\u003Cli>Kubernetes: Sonnet 94.5%, Opus 97.3%\u003C\u002Fli>\u003Cli>AWS Compute: Sonnet 94.3%, Opus 96.6%\u003C\u002Fli>\u003Cli>AWS Network: Sonnet 97.1%, Opus 97.1%\u003C\u002Fli>\u003Cli>AWS IAM: Sonnet 85.2%, Opus 92.2%\u003C\u002Fli>\u003Cli>AWS S3: Sonnet 75.7%, Opus 91.9%\u003C\u002Fli>\u003C\u002Ful>\u003Cp>The biggest spread is in AWS S3, where Opus leads by 16.2 points. AWS IAM is next, with a 7-point gap. Those are the kinds of tasks where a routing system makes sense: send policy-heavy questions to Opus, keep broader infrastructure work on Sonnet, and cut the average cost without giving up too much accuracy.\u003C\u002Fp>\u003Ch2>Agentic incident work changes the picture\u003C\u002Fh2>\u003Cp>Rootly says the benchmark numbers do not fully capture what happens during a live incident. Its AI SRE has to pull metrics and logs, trace faults across services, and narrow the issue to a root cause before suggesting a fix. That is a longer chain of reasoning than a multiple-choice answer or a single-turn code task.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782750772129-xkv5.png\" alt=\"Claude Sonnet 4.6 narrows the SRE gap\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>On Rootly’s internal incident suite, Sonnet 4.6 performed similarly to Opus 4.6 on root-cause accuracy, and in some cases beat it. Both models outperformed Opus 4.5 on the hardest investigations, but Sonnet 4.6 did it at about 40% lower per-token cost.\u003C\u002Fp>\u003Cp>That result lines up with Anthropic’s new \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fclaude-4-6\" target=\"_blank\" rel=\"noopener\">adaptive thinking\u003C\u002Fa> system. The model can spend less effort while gathering evidence and more effort once it starts forming a diagnosis. For incident response, that is a good fit because the early phase is mostly retrieval and correlation, while the late phase is about deciding which failure chain actually explains the outage.\u003C\u002Fp>\u003Cp>Rootly also points to two other Claude 4.6 features that matter for AI SRE work:\u003C\u002Fp>\u003Cul>\u003Cli>A 1M-token context window, which helps when logs and traces get long\u003C\u002Fli>\u003Cli>Context compaction, which summarizes older turns during extended investigations\u003C\u002Fli>\u003Cli>Improved prompt-injection resistance, useful when agents read untrusted logs and webhook payloads\u003C\u002Fli>\u003Cli>Four effort levels for adaptive thinking: low, medium, high, and max\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What this means for AI SRE teams\u003C\u002Fh2>\u003Cp>The practical lesson is that one model does not need to do everything. If your incident assistant handles Kubernetes triage, cloud compute questions, and broad SRE knowledge, Sonnet 4.6 looks strong enough to carry a lot of the load. If it has to reason through IAM policies or S3 permission boundaries, Opus still has a clear edge.\u003C\u002Fp>\u003Cp>That suggests a routing strategy that is more like operations than model worship. Put the cheaper model on the common path, escalate the hard policy cases, and keep the expensive calls for the questions that really need them. For teams watching cloud spend, that is a cleaner tradeoff than defaulting every incident to the most expensive model.\u003C\u002Fp>\u003Cp>Rootly says it runs every frontier model through SRE-skills-bench on launch, and it publishes the leaderboard at \u003Ca href=\"https:\u002F\u002Fsreskillsbench.com\" target=\"_blank\" rel=\"noopener\">sreskillsbench.com\u003C\u002Fa>. That kind of public, domain-specific evaluation is useful because it rewards the thing SRE teams actually care about: fewer wrong turns during an outage.\u003C\u002Fp>\u003Cp>The bigger question now is whether other incident tools will copy this split-model approach. If Sonnet 4.6 can handle the bulk of investigation work while Opus picks up the hardest policy and permission cases, AI SRE products may start to look less like a single monolithic assistant and more like a routed system with different models for different failure modes.\u003C\u002Fp>","Rootly’s benchmark shows Claude Sonnet 4.6 closing much of the gap with Opus 4.6 on SRE tasks, especially incident investigations.","rootly.com","https:\u002F\u002Frootly.com\u002Fblog\u002Fclaude-sonnet-4-6-benchmark-results-and-lessons-for-ai-sre",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782750772754-cmvk.png","research","en","d6f25c66-98f5-4971-8d1d-487fb5fe1881",[17,18,19,20,21],"Claude Sonnet 4.6","Rootly","AI SRE","incident response","SRE benchmark",[23,24,25],"Sonnet 4.6 jumped to 90.4% on Rootly’s SRE benchmark, up 4.5 points from Sonnet 4.5.","Opus 4.6 still leads on IAM and S3, but Sonnet 4.6 closes much of the gap on broader SRE tasks.","Rootly’s incident suite suggests adaptive thinking matters more than static benchmark scores for AI SRE.",0,"2026-06-29T16:32:28.970805+00:00","2026-06-29T16:32:28.957+00:00","3a949a81-75cc-4a29-a9ce-24903ce51366",{"tags":31,"relatedLang":32,"relatedPosts":36},[],{"id":15,"slug":33,"title":34,"language":35},"claude-sonnet-46-sre-benchmark-rootly-zh","Claude Sonnet 4.6 對上 SRE 工作更接近 Opus","zh",[37,43,49,55,61,67],{"id":38,"slug":39,"title":40,"cover_image":41,"image_url":41,"created_at":42,"category":13},"ab888d55-3985-46f0-b026-5a5101541cdf","glm-52-beats-claude-semgrep-idor-test-en","GLM 5.2 beats Claude in Semgrep’s IDOR test","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782749876047-ciry.png","2026-06-29T16:17:32.406761+00:00",{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":13},"3efb3e20-b2da-4abd-b442-3babd8b0ed1e","opd-distillation-skills-without-bruteforce-rl-en","OPD lets you distill skills without brute-force RL","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782730111097-6brq.png","2026-06-29T10:47:57.980973+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"f3edd37b-2524-4d6d-b411-7ca0cce9eff0","google-deepmind-turns-science-into-tools-en","Google DeepMind turns science into tools","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782721105101-d4rm.png","2026-06-29T08:17:58.280652+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"c522f9af-2862-4f1c-bbf9-99bc20c78544","measuring-llm-behavior-portability-en","Measuring when LLM behavior actually переносится","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782717476648-9gjo.png","2026-06-29T07:17:30.115953+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"1a5d9d4d-4e21-4860-84b0-9b209ca4d7f5","prompt-injection-ai-security-problem-en","Prompt injection is now an AI security problem","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782716584463-r1ei.png","2026-06-29T07:02:36.642691+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"fba917c8-939c-4457-a90e-4012d9a692df","solver-choice-nash-equilibrium-selection-en","Solver choice changes which Nash equilibrium wins","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782714784738-e4dj.png","2026-06-29T06:32:31.603116+00:00",[74,79,84,89,94,99,104,109,114,119],{"id":75,"slug":76,"title":77,"created_at":78},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":80,"slug":81,"title":82,"created_at":83},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":85,"slug":86,"title":87,"created_at":88},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]