OpenAI o3
OpenAI previewed o3 in December 2024 as a step-change in reasoning, headlined by a record ARC-AGI score (76% at low compute, ~88% at high compute) that it framed as a leap toward more general intelligence, plus strong results on GPQA Diamond (87.7%), SWE-bench Verified, and Codeforces. It positioned o3 as the successor to o1 and, at the April 2025 launch alongside o4-mini, as its 'most advanced reasoning model' — the first that could agentically use every ChatGPT tool (web search, Python, file and image analysis, image generation). OpenAI said o3 made roughly 20% fewer major errors than o1 on hard real-world tasks.
Full o3 shipped April 16, 2025 to ChatGPT Plus, Pro, and Team users and via the Chat Completions and Responses APIs, with o3-pro following on June 10, 2025; o3-mini had already reached free and paid tiers on January 31, 2025.
o3 is the rare case where the marketing moment and the shipped product visibly diverged. OpenAI and the ARC Prize Foundation later confirmed the released o3 was a different, smaller model than the December demo, using less compute and not tuned on ARC-AGI data — its ARC-AGI-1 score fell to roughly 41-53%, and cost-per-task estimates for the demo were revised sharply upward. Strip away the AGI framing and what actually shipped was still a genuinely strong, tool-using reasoning model that became a daily workhorse for coding, analysis, and research. The lesson for executives is that a headline benchmark from a staged preview is not the product you can buy. Judge the released model, on your own tasks, not the launch-day chart.
o3 turned 'reasoning models' from a research demo into a practical default for hard knowledge work, while also showing why a flashy preview benchmark deserves skepticism until the shipping version is independently measured.