[RSCH] 3 min readOraCore Editors

Rootly benchmark: Llama 4 trails coding models

Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark, with GPT-4o and Qwen2.5-Coder ahead.

Share LinkedIn
Rootly benchmark: Llama 4 trails coding models

Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark.

Rootly says its AI Labs benchmark found Llama 4 underperformed on coding tasks, even versus its older sibling, Llama 3.3. The test, published April 11, 2025, used 100 Mastodon GitHub bug issues and asked models to pick the correct pull request from four choices.

項目數值
Benchmark size100 GitHub bug issues
Llama 4 Maverick accuracy70%
Llama 4 overall accuracy69.5%
DeepSeek v3.1 gap6% ahead of Llama 4
GPT-4o gap18% ahead of Llama 4
Qwen2.5-Coder-32B accuracyAbout 90%

What changed

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Rootly AI Labs compared Llama 4 Scout, Maverick, and Behemoth against both general multimodal models and coding-tuned systems. The team says it could not reproduce Meta’s claim that Llama 4 beats GPT-4o, Gemini 2.0 Flash, and DeepSeek v3.1 on reasoning and coding.

Rootly benchmark: Llama 4 trails coding models

The benchmark setup was simple: each model saw a bug report plus four candidate PRs, with one correct match. No codebase context was included. Rootly says that made the task closer to a real triage workflow than a broad academic benchmark.

  • Llama 4 came last in Rootly’s accuracy ranking at 69.5%.
  • Llama 3.3 70B-Versatile scored 72%, edging out Llama 4.
  • DeepSeek v3.1 beat Llama 4 by 6 percentage points.
  • GPT-4o led Llama 4 by 18 percentage points.
  • Qwen2.5-Coder-32B and OpenAI o3-mini landed near 90%.

Why it matters

For developers, the result is a reminder that benchmark headlines can hide task-specific gaps. A model that looks strong on general tests may still miss the mark on code triage, bug fixing, or incident response workflows.

Rootly benchmark: Llama 4 trails coding models

For teams choosing an LLM, the practical takeaway is narrower: if the job is coding help, Rootly’s data points to specialized models such as Qwen-code or o3-mini rather than a general-purpose release like Llama 4.

Rootly says the dataset is open source and the test set is small, so the numbers are not final word on model quality. The sharper question is whether Llama 4’s architecture helps in broad chat tasks more than in the coding work developers actually need.