GPT-5
The launch was technically credible but operationally rocky, and it became a case study in change management. On day one the automatic router malfunctioned for part of the day, making GPT-5 feel — in Altman's own word…
Read the verdictWe track each major AI model from announcement to release, then publish a reality verdict after the dust settles — separating the launch claims from what the model is observed to do, in plain language for executives.
The launch was technically credible but operationally rocky, and it became a case study in change management. On day one the automatic router malfunctioned for part of the day, making GPT-5 feel — in Altman's own word…
Read the verdictThe capability claims largely held up: Gemini 2.5 Pro genuinely led LMArena at launch and was widely regarded as a top-tier reasoning and long-context model through most of 2025, and its 1M-token context was a real, u…
Read the verdictThe "best coding model" claim held up on the benchmark Anthropic chose to lead with, but it was a coding-specific crown, not across-the-board supremacy: contemporaneous coverage noted Opus 4 trailed OpenAI's o3 on som…
Read the verdictThe core technical achievement was real and largely survived scrutiny: independent analysts agreed R1 reached roughly o1-class reasoning, and a Chinese lab catching the frontier in months was genuinely significant. Th…
Read the verdicto1 proved that test-time compute is a genuine new scaling lever and kicked off the "reasoning model" era that competitors and OpenAI itself rapidly built on (o3 was previewed within months). The benchmark gains on mat…
Read the verdictThe launch delivered on its headline claim — a genuinely frontier-scale model anyone could download — and reset the ceiling for what "open" AI could do. In practice, the 405B model itself saw limited direct deployment…
Read the verdictThis launch is widely regarded, in hindsight, as the moment Claude became the default model for serious software development. The coding and agentic-quality leap was real and held up under independent use, driving rap…
Read the verdictGPT-4o's lasting impact was less about a leap in raw intelligence (it was roughly GPT-4-Turbo-class on text) and more about cost, speed, and reach: it became the default ChatGPT model for hundreds of millions of users…
Read the verdictGemini 1.5 Pro reframed the competitive axis from raw benchmark scores to usable context length, and the 1M-token (later 2M) window proved genuinely useful for document-heavy, video, and large-codebase workloads that …
Read the verdictLlama 2 was the launch that made open-weight models a serious enterprise option, seeding a vast ecosystem of fine-tunes, hosted endpoints, and on-prem deployments that gave organizations a path off the API-only fronti…
Read the verdictGPT-4 was a genuine capability step, more reliable, better at reasoning, and far more useful for real professional work than GPT-3.5, and it became the default "serious" model underpinning a huge wave of enterprise de…
Read the verdictThis was the launch that mattered, and its impact came almost entirely from packaging rather than from a new frontier model. The underlying capability had largely existed in GPT-3.5 for months; wrapping it in a free, …
Read the verdictGPT-3's real significance was strategic, not consumer: it proved the scaling thesis that defined the next half-decade of AI and seeded the first wave of startups built on top of a hosted model API, while OpenAI's pivo…
Read the verdict