Model Watch13 tracked

Every major model launch. What it actually means.

We track each major AI model from announcement to release, then publish a reality verdict after the dust settles — separating the launch claims from what the model is observed to do, in plain language for executives.

ReviewedOpenAIAug 7, 2025

GPT-5

The launch was technically credible but operationally rocky, and it became a case study in change management. On day one the automatic router malfunctioned for part of the day, making GPT-5 feel — in Altman's own word…

Read the verdict

ReviewedGoogleJun 17, 2025

Gemini 2.5 Pro

The capability claims largely held up: Gemini 2.5 Pro genuinely led LMArena at launch and was widely regarded as a top-tier reasoning and long-context model through most of 2025, and its 1M-token context was a real, u…

Read the verdict

ReviewedAnthropicMay 22, 2025

Claude Opus 4

The "best coding model" claim held up on the benchmark Anthropic chose to lead with, but it was a coding-specific crown, not across-the-board supremacy: contemporaneous coverage noted Opus 4 trailed OpenAI's o3 on som…

Read the verdict

ReviewedDeepSeekJan 20, 2025

R1

The core technical achievement was real and largely survived scrutiny: independent analysts agreed R1 reached roughly o1-class reasoning, and a Chinese lab catching the frontier in months was genuinely significant. Th…

Read the verdict

ReviewedOpenAIDec 5, 2024

o1

o1 proved that test-time compute is a genuine new scaling lever and kicked off the "reasoning model" era that competitors and OpenAI itself rapidly built on (o3 was previewed within months). The benchmark gains on mat…

Read the verdict

ReviewedMetaJul 23, 2024

Llama 3.1 405B

The launch delivered on its headline claim — a genuinely frontier-scale model anyone could download — and reset the ceiling for what "open" AI could do. In practice, the 405B model itself saw limited direct deployment…

Read the verdict

ReviewedAnthropicJun 20, 2024

Claude 3.5 Sonnet

This launch is widely regarded, in hindsight, as the moment Claude became the default model for serious software development. The coding and agentic-quality leap was real and held up under independent use, driving rap…

Read the verdict

ReviewedOpenAIMay 13, 2024

GPT-4o

GPT-4o's lasting impact was less about a leap in raw intelligence (it was roughly GPT-4-Turbo-class on text) and more about cost, speed, and reach: it became the default ChatGPT model for hundreds of millions of users…

Read the verdict

ReviewedGoogleFeb 15, 2024

Gemini 1.5 Pro

Gemini 1.5 Pro reframed the competitive axis from raw benchmark scores to usable context length, and the 1M-token (later 2M) window proved genuinely useful for document-heavy, video, and large-codebase workloads that …

Read the verdict

ReviewedMetaJul 18, 2023

Llama 2

Llama 2 was the launch that made open-weight models a serious enterprise option, seeding a vast ecosystem of fine-tunes, hosted endpoints, and on-prem deployments that gave organizations a path off the API-only fronti…

Read the verdict

ReviewedOpenAIMar 14, 2023

GPT-4

GPT-4 was a genuine capability step, more reliable, better at reasoning, and far more useful for real professional work than GPT-3.5, and it became the default "serious" model underpinning a huge wave of enterprise de…

Read the verdict

ReviewedOpenAINov 30, 2022

ChatGPT (GPT-3.5)

This was the launch that mattered, and its impact came almost entirely from packaging rather than from a new frontier model. The underlying capability had largely existed in GPT-3.5 for months; wrapping it in a free, …

Read the verdict

ReviewedOpenAIJun 11, 2020

GPT-3

GPT-3's real significance was strategic, not consumer: it proved the scaling thesis that defined the next half-decade of AI and seeded the first wave of startups built on top of a hosted model API, while OpenAI's pivo…

Read the verdict