OpenAI o1
OpenAI introduced o1 as a new class of model that "thinks before it answers," spending more inference-time compute generating an internal chain of thought to tackle hard reasoning problems in science, coding, and math. It claimed PhD-level performance on physics, chemistry, and biology benchmarks, and headline jumps on competition math (roughly 83% on AIME 2024 versus about 13% for GPT-4o) and an 89th-percentile Codeforces ranking. The launch framed this as the first in a new series of reasoning models — a different scaling axis from simply making models bigger — and was paired with a smaller, cheaper o1-mini aimed at STEM and coding.
On September 12, 2024 the preview shipped as o1-preview and o1-mini, selectable in ChatGPT for Plus and Team users and available via the API to higher-tier developers, at prices several times above GPT-4o. The full o1 model became generally available on December 5, 2024, adding image input and faster, more accurate responses, alongside a new $200/month ChatGPT Pro tier offering an o1 "pro mode."
o1 proved that test-time compute is a genuine new scaling lever and kicked off the "reasoning model" era that competitors and OpenAI itself rapidly built on (o3 was previewed within months). The benchmark gains on math and code were real and impressive, but the practical picture was more nuanced: it was markedly slower and more expensive, offered no clear advantage on many everyday tasks, and hid its reasoning steps while still billing for those hidden "reasoning tokens" — an unusual and somewhat opaque cost model. The PhD-level framing oversold day-to-day utility; the real significance was architectural and directional, signaling that the next gains would come from how models compute at inference, not just from training larger models. For most organizations the immediate value was narrow (hard technical and analytical problems), with the broader payoff arriving through the cheaper, faster reasoning models it spawned.
It signaled that AI progress is shifting from bigger pre-training to spending more compute at the moment of answering, which changes both the cost structure and where these models add the most value. For executives, it reframes "which model" decisions around matching task difficulty to compute spend rather than assuming one model fits every use case.