Rootly benchmark: Llama 4 trails coding models

OraCore Editors

[RSCH] June 22, 20263 min readOraCore Editors

Rootly benchmark: Llama 4 trails coding models

Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark, with GPT-4o and Qwen2.5-Coder ahead.

Llama 4 GPT-4o benchmark

Share LinkedIn

Rootly benchmark: Llama 4 trails coding models

Rootly AI Labs says Llama 4 lagged coding-focused models on a Mastodon GitHub benchmark.

Rootly says its AI Labs benchmark found Llama 4 underperformed on coding tasks, even versus its older sibling, Llama 3.3. The test, published April 11, 2025, used 100 Mastodon GitHub bug issues and asked models to pick the correct pull request from four choices.

項目	數值
Benchmark size	100 GitHub bug issues
Llama 4 Maverick accuracy	70%
Llama 4 overall accuracy	69.5%
DeepSeek v3.1 gap	6% ahead of Llama 4
GPT-4o gap	18% ahead of Llama 4
Qwen2.5-Coder-32B accuracy	About 90%

What changed

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Rootly AI Labs compared Llama 4 Scout, Maverick, and Behemoth against both general multimodal models and coding-tuned systems. The team says it could not reproduce Meta’s claim that Llama 4 beats GPT-4o, Gemini 2.0 Flash, and DeepSeek v3.1 on reasoning and coding.

The benchmark setup was simple: each model saw a bug report plus four candidate PRs, with one correct match. No codebase context was included. Rootly says that made the task closer to a real triage workflow than a broad academic benchmark.

Llama 4 came last in Rootly’s accuracy ranking at 69.5%.
Llama 3.3 70B-Versatile scored 72%, edging out Llama 4.
DeepSeek v3.1 beat Llama 4 by 6 percentage points.
GPT-4o led Llama 4 by 18 percentage points.
Qwen2.5-Coder-32B and OpenAI o3-mini landed near 90%.

Why it matters

For developers, the result is a reminder that benchmark headlines can hide task-specific gaps. A model that looks strong on general tests may still miss the mark on code triage, bug fixing, or incident response workflows.

For teams choosing an LLM, the practical takeaway is narrower: if the job is coding help, Rootly’s data points to specialized models such as Qwen-code or o3-mini rather than a general-purpose release like Llama 4.

Rootly says the dataset is open source and the test set is small, so the numbers are not final word on model quality. The sharper question is whether Llama 4’s architecture helps in broad chat tasks more than in the coding work developers actually need.

// Related Articles

Rootly benchmark: Llama 4 trails coding models

What changed

Get the latest AI news in your inbox

Why it matters

8台机器人怎么自己做实验

XtraGPT lets you revise papers with control

Skill-to-LoRA cuts agent token overhead

TurboQuant does not hurt search quality at equal byte budgets

Deterministic multicalibration finally hits optimal sample use

UNIEGO unifies egocentric video with proxy teachers