AI Reality Check10 claims graded

AI Reality Check. Claims, graded against reality.

We track claims made by AI labs, vendors, executives, and analysts, then grade them against what actually happened — with sources. It is not a record of AI4C predictions; we don't make them.

Hit
2
Partial
4
Miss
4
HitGartner (Rita Sallam)

At least 30% of generative AI projects will be abandoned after proof of concept by the end of 2025.

Horizon

End of 2025

Resolution criteria

At least 30% of enterprise GenAI proofs-of-concept abandoned, per industry surveys.

What actually happened

Met, and likely understated — by the end of 2025 multiple analyses put POC-to-abandonment in the 30–50% range.

Postmortem

A rare bearish forecast from a firm that usually sells optimism — and it landed. The lesson for leaders: a working pilot is not a working product; budget for the gap between the two.

Source: Gartner
MissDario Amodei (Anthropic CEO)

AI will be writing 90% of the code in three to six months.

Horizon

~September 2025

Resolution criteria

AI authoring roughly 90% of code across the industry / most teams.

What actually happened

Did not happen industry-wide by September 2025. Amodei later reframed it as true "on many teams" inside Anthropic — not the broad claim as originally stated.

Postmortem

The capability is advancing fast, but the headline number and timeline were a leader talking his own book. Watch the pattern: a bold public number, then a quiet internal reframing.

Source: Yahoo Finance
PartialSam Altman (OpenAI CEO)

In 2025 we may see the first AI agents join the workforce and materially change the output of companies.

Horizon

2025

Resolution criteria

AI agents in real production use, materially changing company output.

What actually happened

Agent products shipped (Operator, Copilot, Claude tools) and entered pilots, but "materially change the output of companies" stayed aspirational — most deployments remained experimental.

Postmortem

The tools arrived on schedule; the impact did not. The gap between "agents exist" and "agents move the P&L" is exactly what enterprise pilot data later exposed.

Source: Axios
MissElon Musk

We will have AI that is smarter than any one human probably around the end of 2025.

Horizon

End of 2025

Resolution criteria

A single AI system broadly smarter than the smartest human.

What actually happened

No such system by end of 2025. Critics publicly offered a $1M (raised to $10M) bet against it; the offer went untaken.

Postmortem

Models hit superhuman marks on narrow tests (see the Math Olympiad entry) while staying far from "smarter than any human" in general. Be wary of single-number AGI timelines — especially from people selling a model.

Source: Fortune
MissElon Musk (Tesla)

Unsupervised Full Self-Driving and a robotaxi network across 8–10 metros, reaching half the US population, by the end of 2025.

Horizon

End of 2025

Resolution criteria

Driverless robotaxis at the promised scale; unsupervised FSD for owners.

What actually happened

A robotaxi pilot launched in Austin in mid-2025 with human safety drivers. The promised multi-metro scale and unsupervised FSD did not arrive; the goal slipped to "widespread by end of 2026."

Postmortem

A near-annual pattern: a concrete, dated autonomy promise that slips by roughly a year. Useful for calibrating any "next year" self-driving claim.

Source: InsideEVs
PartialCognition

Devin, "the first AI software engineer" — a new state-of-the-art that passed real engineering interviews and completed paid Upwork jobs.

Horizon

At launch

Resolution criteria

Autonomous, end-to-end software engineering at the implied human level.

What actually happened

The headline benchmark (13.86% on SWE-bench) ran under favorable conditions; independent real-world completion landed near that figure, and stronger models surpassed it within a year. Impressive, but well short of "an AI software engineer."

Postmortem

A masterclass in benchmark framing. When a demo claims to replace a human job, ask for the test conditions before you trust the headline number.

Source: Independent review
PartialKlarna (CEO Sebastian Siemiatkowski)

Klarna’s OpenAI-powered assistant does the work of ~700 customer-service agents — AI can run the function.

Horizon

2024–2025

Resolution criteria

AI sustainably replacing the human customer-service function.

What actually happened

Mixed. Klarna walked back the all-in approach in 2025 — "we went too far" — and rehired humans for quality, while still reporting the AI doing ~853-agent-equivalent work and ~$60M in savings.

Postmortem

Both stories are true: real efficiency, and a quality/trust ceiling that forced humans back in. The honest read is "AI plus humans," not "AI instead of humans."

Source: Entrepreneur
PartialMark Zuckerberg (Meta CEO)

In 2025, Meta and others will have an AI that can effectively be a mid-level engineer that writes code.

Horizon

2025

Resolution criteria

AI performing at a mid-level engineer’s level and replacing those roles.

What actually happened

Partly. AI coding tools became genuinely strong and reshaped hiring (fewer junior/mid backfills), but wholesale replacement of mid-level engineers did not occur — the role shifted toward AI-augmented engineers.

Postmortem

The capability narrowed the gap; the org change was "hire differently," not "replace." The durable signal is a changed hiring mix, not a headcount wipeout.

Source: Entrepreneur
HitOpenAI & Google DeepMind

Reasoning models will reach elite, gold-medal-level performance on competition mathematics.

Horizon

2025

Resolution criteria

Gold-medal-level IMO performance by a general-purpose model.

What actually happened

Achieved — both reached gold-medal-level (35/42 points, 5 of 6 problems) with general-purpose reasoning models; DeepMind’s entry was officially graded.

Postmortem

A genuine, verifiable leap — and a counterweight to the misses: narrow, well-defined reasoning is advancing faster than the broad "AGI" and jobs claims. The caution is generalizing from a contest to the economy.

Source: Axios
MissGary Marcus (AI researcher & critic)

Deep learning is "hitting a wall" — scaling has run out and LLMs will stop improving.

Horizon

Ongoing

Resolution criteria

LLM capability visibly plateauing.

What actually happened

Repeatedly followed by major gains (GPT-4, Claude, reasoning models, IMO gold). Marcus maintains the deeper "reliable reasoning" critique still holds — contested.

Postmortem

Included for balance: skeptics miss too. The strong "wall" framing did not hold, even as the narrower point — that models still reason unreliably — remains a live debate worth tracking.

Source: Marcus on AI

The grading process is public by design.

Method
01

Capture the claim

We preserve the wording, source, date, speaker, and audience instead of paraphrasing a bold forecast into something easier to grade later.

02

Define the test

Each entry needs a time horizon and resolution criteria before it can appear publicly. Ambiguous claims stay out of the tracker.

03

Check against evidence

Resolved claims are compared with public filings, product releases, adoption data, regulatory outcomes, and contemporaneous reporting.

04

Publish the grade

The result gets a short postmortem that explains what the industry got right, what it missed, and what readers should watch next.

Cadence

New claims added as they resolve.

We add 2–4 entries a month as forecasts mature, product promises come due, and adoption claims meet public evidence. Spot a claim we should grade? Tell us.