[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-swe-bench":3},{"tag":4,"articles":11,"peer_article_count":110},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"7f2f1f94-2fd6-4985-9136-b9715dbf8f06","SWE-Bench","swe-bench",11,"SWE-bench 是用真實 GitHub issue 評估程式修復能力的基準，常分成 Verified、Lite 等版本。它反映模型與 agent 是否能讀懂程式庫、定位 bug、修改測試並維持可重現性，也常被用來比較 coding agent 的成本與效率。","SWE-bench is a benchmark for measuring whether models and coding agents can fix real GitHub issues end to end. Its variants, including Verified and Lite, are used to compare bug localization, test-driven edits, and the cost of agentic repair workflows.",[12,21,29,37,44,51,58,65,73,80,88,95,102],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"2c317df8-4070-4c74-bab5-48f79fe2860e","claude-vs-gpt-vs-gemini-coding-benchmark-leaderboard-en","Claude vs GPT vs Gemini: Coding Benchmark Leaderboard","A June 2026 coding benchmark comparison of Claude, GPT, and Gemini for model buyers.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781939876788-ivgw.png","en","2026-06-20T07:17:35.473285+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":26,"image_url":27,"cover_image":27,"language":19,"created_at":28},"e891adc0-af64-41c7-bb41-d75e6506d388","ai-benchmarks-2026-evaluations-limits-en","AI Benchmarks 2026: Top Evaluations and Limits","MMLU, HLE, SWE-Bench and agent tests are hitting limits in 2026, while production gaps and contamination keep human review necessary.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781381870944-h208.png","2026-06-13T20:17:26.361723+00:00",{"id":30,"slug":31,"title":32,"summary":33,"category":34,"image_url":35,"cover_image":35,"language":19,"created_at":36},"e7f37851-7b5f-429c-9a71-3e4a2d4b9c70","mimo-v2-flash-openrouter-benchmarks-pricing-en","MiMo-V2-Flash hits top open-source SWE-bench scores","Xiaomi’s MiMo-V2-Flash tops open-source SWE-bench scores while OpenRouter lists it at $0.10\u002F$0.30 per 1M tokens.","model-release","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781321563162-27yb.png","2026-06-13T03:32:17.731154+00:00",{"id":38,"slug":39,"title":40,"summary":41,"category":26,"image_url":42,"cover_image":42,"language":19,"created_at":43},"d389cb06-cef8-48a6-abfc-0c5f5bcb6a26","anthropic-ai-building-ai-recursive-self-improvement-en","Anthropic’s own data says AI is already building AI","Anthropic’s data shows AI is already accelerating AI development, and that should alarm every serious builder.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781257684774-rwor.png","2026-06-12T09:47:25.328276+00:00",{"id":45,"slug":46,"title":47,"summary":48,"category":34,"image_url":49,"cover_image":49,"language":19,"created_at":50},"32fa153a-374d-415e-871d-8d0bfad55c03","kimi-k2-6-complete-guide-2026-en","Kimi K2.6: What Changed in 2026","Kimi K2.6 is Moonshot AI’s open-weights flagship, with agent swarms, INT4 weights, and top-tier coding scores.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778981035996-gige.png","2026-05-17T01:23:32.766474+00:00",{"id":52,"slug":53,"title":54,"summary":55,"category":34,"image_url":56,"cover_image":56,"language":19,"created_at":57},"cb6097c9-9b15-4ff5-860d-5d1b172035db","kimi-k26-qwen-36-open-source-frontier-gap-en","Kimi K2.6 and Qwen 3.6 Narrow the Gap","Kimi K2.6 and Qwen 3.6 are open-weight models that now rival closed models on coding and agent tasks.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777901475047-q3mm.png","2026-05-04T13:30:41.54989+00:00",{"id":59,"slug":60,"title":61,"summary":62,"category":26,"image_url":63,"cover_image":63,"language":19,"created_at":64},"904270f5-c35d-4938-915f-99b405511466","ai-agents-token-spending-coding-tasks-en","How AI Agents Spend Your Money: 1000x Tokens on SWE-bench","A study of SWE-bench Verified shows agentic coding can consume 1000x more tokens than chat, with costs driven by inputs and hard to predict.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777270014409-t6t5.png","2026-04-27T06:06:38.046891+00:00",{"id":66,"slug":67,"title":68,"summary":69,"category":34,"image_url":70,"cover_image":71,"language":19,"created_at":72},"674cce69-5be8-4c32-bfbd-32ab6fd2fcd7","qwen36-27b-open-source-coding-model-en","Qwen3.6-27B opens a smaller, sharper path to coding","Qwen3.6-27B is a 27B dense multimodal model that beats Qwen3.5-397B-A17B on key coding benchmarks while staying easier to deploy.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777260618061-cpw4.png","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Fskill-cover-qwen36-27b-en_-1777263004.png","2026-04-27T00:12:39.968514+00:00",{"id":74,"slug":75,"title":76,"summary":77,"category":34,"image_url":78,"cover_image":78,"language":19,"created_at":79},"993f67fa-c342-4b67-b7f6-144efc0a0eca","claude-mythos-preview-beats-gpt-54-gemini-benchmarks-en","Claude Mythos Preview Tops GPT-5.4 on Key Benchmarks","Anthropic’s unreleased Mythos Preview beats GPT-5.4 and Gemini 3.1 Pro on coding, math, and agent tests, led by 97.6% on USAMO.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776082017256-9j4y.png","2026-04-13T12:06:36.377043+00:00",{"id":81,"slug":82,"title":83,"summary":84,"category":85,"image_url":86,"cover_image":86,"language":19,"created_at":87},"1a496462-2097-4efc-9a2b-17e192da4c86","tested-devin-10-tasks-finished-3-en","I Tested Devin on 10 Tasks. It Finished 3.","Devin scored 13.86% on SWE-bench and finished 3 of 10 real tasks in one test, showing where AI coding agents still fall short.","ai-agent","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775167972790-r3zi.png","2026-04-02T22:12:37.642077+00:00",{"id":89,"slug":90,"title":91,"summary":92,"category":34,"image_url":93,"cover_image":93,"language":19,"created_at":94},"04e78fe1-7f49-40db-bfb2-7bb4b3579276","gemini-3-1-pro-googles-top-model-in-numbers-en","Gemini 3.1 Pro: Google’s new top model in numbers","Gemini 3.1 Pro posts 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and a 1M-token context window, while keeping Gemini 3 pricing.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775153582956-qese.png","2026-04-02T18:12:42.161483+00:00",{"id":96,"slug":97,"title":98,"summary":99,"category":34,"image_url":100,"cover_image":100,"language":19,"created_at":101},"91fe9555-c2db-4489-babe-df23943ec39b","glm-5-zai-flagship-coding-agents-en","GLM-5: Z.AI's new flagship for coding and agents","GLM-5 posts 77.8 on SWE-bench Verified and 56.2 on Terminal Bench 2.0, putting Z.AI in direct competition with top coding models.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775135076803-ig5q.png","2026-04-02T13:03:42.827978+00:00",{"id":103,"slug":104,"title":105,"summary":106,"category":34,"image_url":107,"cover_image":108,"language":19,"created_at":109},"f063d8d1-41d1-4de4-8ebc-6c40511b9369","xiaomi-mimo-v2-pro-1t-moe-agents-en","Xiaomi MiMo-V2-Pro: 1T MoE Model for Agents","Xiaomi’s MiMo-V2-Pro packs 1T parameters, 42B active, and 1M context, with SWE-bench results close to Claude Sonnet 4.6.",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1774619185536-iewn.png","2026-03-28T03:06:19.238032+00:00",7]