HitGartner (Rita Sallam)
“At least 30% of generative AI projects will be abandoned after proof of concept by the end of 2025.”
July 2024 · Gartner press release
Resolution criteria
At least 30% of enterprise GenAI proofs-of-concept abandoned, per industry surveys.
What actually happened
Met, and likely understated — by the end of 2025 multiple analyses put POC-to-abandonment in the 30–50% range.
Postmortem
A rare bearish forecast from a firm that usually sells optimism — and it landed. The lesson for leaders: a working pilot is not a working product; budget for the gap between the two.
Source: GartnerMissDario Amodei (Anthropic CEO)
“AI will be writing 90% of the code in three to six months.”
March 2025 · Council on Foreign Relations
Resolution criteria
AI authoring roughly 90% of code across the industry / most teams.
What actually happened
Did not happen industry-wide by September 2025. Amodei later reframed it as true "on many teams" inside Anthropic — not the broad claim as originally stated.
Postmortem
The capability is advancing fast, but the headline number and timeline were a leader talking his own book. Watch the pattern: a bold public number, then a quiet internal reframing.
Source: Yahoo FinancePartialSam Altman (OpenAI CEO)
“In 2025 we may see the first AI agents join the workforce and materially change the output of companies.”
January 2025 · "Reflections" blog post
Resolution criteria
AI agents in real production use, materially changing company output.
What actually happened
Agent products shipped (Operator, Copilot, Claude tools) and entered pilots, but "materially change the output of companies" stayed aspirational — most deployments remained experimental.
Postmortem
The tools arrived on schedule; the impact did not. The gap between "agents exist" and "agents move the P&L" is exactly what enterprise pilot data later exposed.
Source: AxiosMissElon Musk
“We will have AI that is smarter than any one human probably around the end of 2025.”
April 2024 · interview with Norges Bank’s Nicolai Tangen
Resolution criteria
A single AI system broadly smarter than the smartest human.
What actually happened
No such system by end of 2025. Critics publicly offered a $1M (raised to $10M) bet against it; the offer went untaken.
Postmortem
Models hit superhuman marks on narrow tests (see the Math Olympiad entry) while staying far from "smarter than any human" in general. Be wary of single-number AGI timelines — especially from people selling a model.
Source: FortuneMissElon Musk (Tesla)
“Unsupervised Full Self-Driving and a robotaxi network across 8–10 metros, reaching half the US population, by the end of 2025.”
2024–2025 · earnings calls & launch events
Resolution criteria
Driverless robotaxis at the promised scale; unsupervised FSD for owners.
What actually happened
A robotaxi pilot launched in Austin in mid-2025 with human safety drivers. The promised multi-metro scale and unsupervised FSD did not arrive; the goal slipped to "widespread by end of 2026."
Postmortem
A near-annual pattern: a concrete, dated autonomy promise that slips by roughly a year. Useful for calibrating any "next year" self-driving claim.
Source: InsideEVsPartialCognition
“Devin, "the first AI software engineer" — a new state-of-the-art that passed real engineering interviews and completed paid Upwork jobs.”
March 2024 · launch announcement
Resolution criteria
Autonomous, end-to-end software engineering at the implied human level.
What actually happened
The headline benchmark (13.86% on SWE-bench) ran under favorable conditions; independent real-world completion landed near that figure, and stronger models surpassed it within a year. Impressive, but well short of "an AI software engineer."
Postmortem
A masterclass in benchmark framing. When a demo claims to replace a human job, ask for the test conditions before you trust the headline number.
Source: Independent reviewPartialKlarna (CEO Sebastian Siemiatkowski)
“Klarna’s OpenAI-powered assistant does the work of ~700 customer-service agents — AI can run the function.”
2024 · earnings & press
Resolution criteria
AI sustainably replacing the human customer-service function.
What actually happened
Mixed. Klarna walked back the all-in approach in 2025 — "we went too far" — and rehired humans for quality, while still reporting the AI doing ~853-agent-equivalent work and ~$60M in savings.
Postmortem
Both stories are true: real efficiency, and a quality/trust ceiling that forced humans back in. The honest read is "AI plus humans," not "AI instead of humans."
Source: EntrepreneurPartialMark Zuckerberg (Meta CEO)
“In 2025, Meta and others will have an AI that can effectively be a mid-level engineer that writes code.”
January 2025 · The Joe Rogan Experience
Resolution criteria
AI performing at a mid-level engineer’s level and replacing those roles.
What actually happened
Partly. AI coding tools became genuinely strong and reshaped hiring (fewer junior/mid backfills), but wholesale replacement of mid-level engineers did not occur — the role shifted toward AI-augmented engineers.
Postmortem
The capability narrowed the gap; the org change was "hire differently," not "replace." The durable signal is a changed hiring mix, not a headcount wipeout.
Source: EntrepreneurHitOpenAI & Google DeepMind
“Reasoning models will reach elite, gold-medal-level performance on competition mathematics.”
July 2025 · International Mathematical Olympiad
Resolution criteria
Gold-medal-level IMO performance by a general-purpose model.
What actually happened
Achieved — both reached gold-medal-level (35/42 points, 5 of 6 problems) with general-purpose reasoning models; DeepMind’s entry was officially graded.
Postmortem
A genuine, verifiable leap — and a counterweight to the misses: narrow, well-defined reasoning is advancing faster than the broad "AGI" and jobs claims. The caution is generalizing from a contest to the economy.
Source: AxiosMissGary Marcus (AI researcher & critic)
“Deep learning is "hitting a wall" — scaling has run out and LLMs will stop improving.”
2022, restated through 2024–2025 · Marcus on AI
Resolution criteria
LLM capability visibly plateauing.
What actually happened
Repeatedly followed by major gains (GPT-4, Claude, reasoning models, IMO gold). Marcus maintains the deeper "reliable reasoning" critique still holds — contested.
Postmortem
Included for balance: skeptics miss too. The strong "wall" framing did not hold, even as the narrower point — that models still reason unreliably — remains a live debate worth tracking.
Source: Marcus on AI