Model Watch · Reviewed

Meta Llama 4

Announced Apr 5, 2025Released Apr 5, 2025Reviewed Jun 23, 2026

What they claimed

Meta positioned Llama 4 as the start of a 'new era of natively multimodal AI' and its first family built on a Mixture-of-Experts (MoE) architecture. Two models shipped at launch: Scout (17B active parameters, 16 experts, a claimed 10M-token context window) and Maverick (17B active, 128 experts), both pitched as multimodal and efficient. Meta claimed Maverick beat GPT-4o and Gemini 2.0 Flash across many benchmarks and matched DeepSeek v3 on reasoning and coding at less than half the active parameters. A third, still-training teacher model, Behemoth (288B active, ~2T total), was previewed as outperforming GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM tests.

What shipped

Scout and Maverick shipped the same day as open-weight downloads on llama.com and Hugging Face under the Llama 4 Community License (commercial use allowed, but firms with over 700M monthly active users need a special license and EU-domiciled entities are restricted). Behemoth was not released and remained in training.

The verdict

The launch became a textbook claims-vs-reality case. Meta touted a number-two LMArena finish, but the version it submitted, 'Llama-4-Maverick-03-26-Experimental,' was a chat-tuned variant optimized for human-preference voting and was not the weights anyone could download. When the actual released Maverick was tested, it landed around 32nd on the same leaderboard, below older rivals like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. LMArena publicly rebuked Meta, saying its interpretation of the rules 'did not match what we expect,' apologized for unclear labeling, and tightened its policy against benchmark-specific tuning. The episode is a clean illustration of how a leaderboard headline can describe a model the public never receives.

Why it matters

It shows why executives should treat a single leaderboard ranking as a marketing artifact, not a procurement signal: the model that tops the chart may differ from the one your teams can actually deploy.

Sources

All tracked models