OpenAI GPT-4o
At its Spring Update, OpenAI announced GPT-4o ("o" for omni) as a single model trained end-to-end across text, vision, and audio, accepting any combination of those inputs and producing text, audio, and image outputs. It claimed near-human conversational latency, responding to audio in as little as 232 milliseconds (320ms average), while matching GPT-4 Turbo on English text and code, improving on non-English languages, and setting new marks on vision and audio benchmarks (e.g., 88.7 on MMLU). It was also pitched as much faster and 50% cheaper than GPT-4 Turbo in the API.
Text and vision in GPT-4o began rolling out immediately, including to ChatGPT free-tier users (with higher limits for Plus/Team) and in the API at $5/$15 per million input/output tokens. The signature real-time speech-to-speech voice mode did not ship at launch; the API initially lacked audio, and Advanced Voice Mode only reached Plus/Team users in September 2024.
GPT-4o's lasting impact was less about a leap in raw intelligence (it was roughly GPT-4-Turbo-class on text) and more about cost, speed, and reach: it became the default ChatGPT model for hundreds of millions of users and made capable AI free, which mattered more commercially than benchmark deltas. The launch demo of fluid, interruptible voice generated enormous hype, but that experience shipped months later in a more constrained form, making this a case study in the gap between a polished launch event and what is actually in users' hands on day one. The 50% API price cut was the more immediately real story, accelerating the broader collapse in token costs that reshaped build-vs-buy math. Strategically it cemented natively multimodal, low-latency interaction as the expected baseline and set up the later "o-series" reasoning models as OpenAI's next differentiator.
GPT-4o made frontier-grade AI cheap and free-to-use at massive scale, but the lag between its dazzling voice demo and actual availability is a reminder to budget and plan around what has shipped, not what was demoed.