Rootly benchmark: Llama 4 trails coding models
Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark, with GPT-4o and Qwen2.5-Coder ahead.

Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark.
Rootly says its AI Labs benchmark found Llama 4 underperformed on coding tasks, even versus its older sibling, Llama 3.3. The test, published April 11, 2025, used 100 Mastodon GitHub bug issues and asked models to pick the correct pull request from four choices.
| 項目 | 數值 |
|---|---|
| Benchmark size | 100 GitHub bug issues |
| Llama 4 Maverick accuracy | 70% |
| Llama 4 overall accuracy | 69.5% |
| DeepSeek v3.1 gap | 6% ahead of Llama 4 |
| GPT-4o gap | 18% ahead of Llama 4 |
| Qwen2.5-Coder-32B accuracy | About 90% |
What changed
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
Rootly AI Labs compared Llama 4 Scout, Maverick, and Behemoth against both general multimodal models and coding-tuned systems. The team says it could not reproduce Meta’s claim that Llama 4 beats GPT-4o, Gemini 2.0 Flash, and DeepSeek v3.1 on reasoning and coding.

The benchmark setup was simple: each model saw a bug report plus four candidate PRs, with one correct match. No codebase context was included. Rootly says that made the task closer to a real triage workflow than a broad academic benchmark.
- Llama 4 came last in Rootly’s accuracy ranking at 69.5%.
- Llama 3.3 70B-Versatile scored 72%, edging out Llama 4.
- DeepSeek v3.1 beat Llama 4 by 6 percentage points.
- GPT-4o led Llama 4 by 18 percentage points.
- Qwen2.5-Coder-32B and OpenAI o3-mini landed near 90%.
Why it matters
For developers, the result is a reminder that benchmark headlines can hide task-specific gaps. A model that looks strong on general tests may still miss the mark on code triage, bug fixing, or incident response workflows.

For teams choosing an LLM, the practical takeaway is narrower: if the job is coding help, Rootly’s data points to specialized models such as Qwen-code or o3-mini rather than a general-purpose release like Llama 4.
Rootly says the dataset is open source and the test set is small, so the numbers are not final word on model quality. The sharper question is whether Llama 4’s architecture helps in broad chat tasks more than in the coding work developers actually need.
// Related Articles
- [RSCH]
8台机器人怎么自己做实验
- [RSCH]
XtraGPT lets you revise papers with control
- [RSCH]
Skill-to-LoRA cuts agent token overhead
- [RSCH]
TurboQuant does not hurt search quality at equal byte budgets
- [RSCH]
Deterministic multicalibration finally hits optimal sample use
- [RSCH]
UNIEGO unifies egocentric video with proxy teachers