[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-rootly-benchmark-llama-4-trails-coding-models-en":3,"article-related-rootly-benchmark-llama-4-trails-coding-models-en":31,"series-research-354441d5-652c-4658-a446-14f101f5e084":79},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"354441d5-652c-4658-a446-14f101f5e084","rootly-benchmark-llama-4-trails-coding-models-en","Rootly benchmark: Llama 4 trails coding models","\u003Cp data-speakable=\"summary\">Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon \u003Ca href=\"\u002Ftag\u002Fgithub\">GitHub\u003C\u002Fa> \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa>.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Frootly.com\" target=\"_blank\" rel=\"noopener\">Rootly\u003C\u002Fa> says its AI Labs benchmark found \u003Ca href=\"https:\u002F\u002Fai.meta.com\u002Fllama\u002F\" target=\"_blank\" rel=\"noopener\">Llama 4\u003C\u002Fa> underperformed on coding tasks, even versus its older sibling, Llama 3.3. The test, published April 11, 2025, used 100 Mastodon GitHub bug issues and asked models to pick the correct pull request from four choices.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>項目\u003C\u002Fth>\u003Cth>數值\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Benchmark size\u003C\u002Ftd>\u003Ctd>100 GitHub bug issues\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Llama 4 Maverick accuracy\u003C\u002Ftd>\u003Ctd>70%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Llama 4 overall accuracy\u003C\u002Ftd>\u003Ctd>69.5%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>DeepSeek v3.1 gap\u003C\u002Ftd>\u003Ctd>6% ahead of Llama 4\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>GPT-4o gap\u003C\u002Ftd>\u003Ctd>18% ahead of Llama 4\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Qwen2.5-Coder-32B accuracy\u003C\u002Ftd>\u003Ctd>About 90%\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>What changed\u003C\u002Fh2>\u003Cp>Rootly AI Labs compared Llama 4 Scout, Maverick, and Behemoth against both general multimodal models and coding-tuned systems. The team says it could not reproduce \u003Ca href=\"\u002Ftag\u002Fmeta\">Meta\u003C\u002Fa>’s claim that Llama 4 beats \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Findex\u002Fgpt-4o\u002F\" target=\"_blank\" rel=\"noopener\">GPT-4o\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fdeepmind.google\u002Ftechnologies\u002Fgemini\u002F\" target=\"_blank\" rel=\"noopener\">Gemini 2.0 Flash\u003C\u002Fa>, and \u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\" rel=\"noopener\">DeepSeek\u003C\u002Fa> v3.1 on reasoning and coding.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782086567786-wz4t.png\" alt=\"Rootly benchmark: Llama 4 trails coding models\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The benchmark setup was simple: each model saw a bug report plus four candidate PRs, with one correct match. No codebase context was included. Rootly says that made the task closer to a real triage workflow than a broad academic benchmark.\u003C\u002Fp>\u003Cul>\u003Cli>Llama 4 came last in Rootly’s accuracy ranking at 69.5%.\u003C\u002Fli>\u003Cli>Llama 3.3 70B-Versatile scored 72%, edging out Llama 4.\u003C\u002Fli>\u003Cli>DeepSeek v3.1 beat Llama 4 by 6 percentage points.\u003C\u002Fli>\u003Cli>GPT-4o led Llama 4 by 18 percentage points.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fwww.aliyun.com\u002Fen\u002Fproduct\u002Fai\u002Fqwen\" target=\"_blank\" rel=\"noopener\">Qwen2.5-Coder-32B\u003C\u002Fa> and OpenAI o3-mini landed near 90%.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Why it matters\u003C\u002Fh2>\u003Cp>For developers, the result is a reminder that benchmark headlines can hide task-specific gaps. A model that looks strong on general tests may still miss the mark on code triage, bug fixing, or incident response workflows.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782086563813-k6bj.png\" alt=\"Rootly benchmark: Llama 4 trails coding models\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>For teams choosing an \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa>, the practical takeaway is narrower: if the job is coding help, Rootly’s data points to specialized models such as \u003Ca href=\"\u002Ftag\u002Fqwen\">Qwen\u003C\u002Fa>-code or o3-mini rather than a general-purpose release like Llama 4.\u003C\u002Fp>\u003Cp>Rootly says the dataset is open source and the test set is small, so the numbers are not final word on model quality. The sharper question is whether Llama 4’s architecture helps in broad chat tasks more than in the coding work developers actually need.\u003C\u002Fp>","Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark, with GPT-4o and Qwen2.5-Coder ahead.","rootly.com","https:\u002F\u002Frootly.com\u002Fblog\u002Fllama-4-underperforms-a-benchmark-against-coding-centric-models",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782086567786-wz4t.png","research","en","10c48be8-a5e6-4153-87d3-573dd4b2aec4",[17,18,19,20,21,22],"Llama 4","benchmark","coding models","Rootly AI Labs","GPT-4o","Qwen2.5-Coder",[24,25,26],"Rootly AI Labs says Llama 4 lagged on a 100-issue coding benchmark.","Llama 4 scored 69.5%, behind Llama 3.3, DeepSeek v3.1, and GPT-4o.","Specialized coding models like Qwen2.5-Coder-32B and o3-mini ranked near 90%.",0,"2026-06-22T00:02:22.751682+00:00","2026-06-22T00:02:22.744+00:00","3a949a81-75cc-4a29-a9ce-24903ce51366",{"tags":32,"relatedLang":38,"relatedPosts":42},[33,35,37],{"name":17,"slug":34},"llama-4",{"name":21,"slug":36},"gpt-4o",{"name":18,"slug":18},{"id":15,"slug":39,"title":40,"language":41},"rootly-benchmark-llama-4-trails-coding-models-zh","Rootly 測試：Llama 4 落後編碼模型","zh",[43,49,55,61,67,73],{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":13},"569999a1-0afb-46a6-929a-2c9089682668","8tai-jiqiren-ziji-zuo-shiyan-en","8台机器人怎么自己做实验","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782073087231-rcfn.png","2026-06-21T20:17:41.340146+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"8cdb1cdd-1014-4c4c-9ea3-63dc78301524","xtragpt-paper-revision-human-ai-collaboration-en","XtraGPT lets you revise papers with control","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782066795284-78ju.png","2026-06-21T18:32:49.655317+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"40d7a637-c770-47b5-8813-fd56a798b332","skill-to-lora-cuts-agent-token-overhead-en","Skill-to-LoRA cuts agent token overhead","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781993875444-u3le.png","2026-06-20T22:17:31.1477+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"405de39d-cfc5-43bf-b47b-ff9ce7be96a9","turboquant-does-not-hurt-search-quality-equal-bytes-en","TurboQuant does not hurt search quality at equal byte budgets","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781857967113-2xax.png","2026-06-19T08:32:22.235692+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"66286461-18c3-42a2-a053-16a87b9a0dd0","deterministic-multicalibration-optimal-sample-use-en","Deterministic multicalibration finally hits optimal sample use","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781850768283-gcmj.png","2026-06-19T06:32:28.768728+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":13},"6dc0410b-c9ec-4148-974b-0b5f7a14975c","uniego-proxy-teachers-egocentric-video-en","UNIEGO unifies egocentric video with proxy teachers","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781849887430-g735.png","2026-06-19T06:17:32.327109+00:00",[80,85,90,95,100,105,110,115,120,125],{"id":81,"slug":82,"title":83,"created_at":84},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":126,"slug":127,"title":128,"created_at":129},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]