OpenAI’s announcement of GPT-5.5 Instant came with a number designed for headlines: 52.5% fewer hallucinated claims than the previous Instant model on high-stakes prompts in medicine, law and finance. A 37.3% reduction on conversations users had previously flagged for factual errors. Both numbers are based on internal evaluations.
That’s a credible engineering improvement and worth taking seriously. The temptation, particularly for any mid-market firm running an LLM in a regulated workflow, is to treat the announcement as a green light to widen scope of deployment. The numbers say half the errors. So we get to do more with it. Right?
Wrong — and the reason is worth understanding before your next AI project review.
What “internal evaluations” can and cannot tell you
The first thing a vendor benchmark measures is the vendor’s benchmark. OpenAI doesn’t publish the specific prompts, the exact rubric, or the per-domain breakdown. They publish a percentage delta against another model on a private test set. That percentage is a real signal that GPT-5.5 Instant is meaningfully better than 5.3 Instant on the things they chose to test. It is not a signal about how it will behave on the things you care about.
This isn’t cynicism. It’s how all model evaluation works. Hallucinations are not a single failure mode — they are a class of failures with different distributions across domains, prompt styles, user populations and jailbreak surfaces. A 52% reduction on OpenAI’s high-stakes test set may be a 70% reduction on your medical-claims summarisation use case and a 5% reduction on your insurance-policy explanation use case, and you have no way to know which without testing.
What an honest evaluation looks like
For any AI system used in a regulated workflow, the only number that matters is the one you measured on your own ground truth. That’s not a hypothetical best practice — it’s an emerging audit expectation.
A serviceable mid-market eval has four ingredients:
A representative prompt set drawn from real user interactions, sampled across the categories you actually deploy to. Twenty curated prompts beats two hundred synthetic ones. You’re trying to characterise reality, not pad a CSV.
A ground-truth dataset with verified correct answers, built by domain experts inside your firm, not by model-graded scoring against another model. This is the part most teams skip and it’s the part that determines whether the eval is real.
A deterministic rubric — for any given prompt, a human grader (or a different model with a strict, narrow rubric) can decide whether the model output is correct, partially correct, hallucinated or refused, and two graders should agree most of the time.
A rerun protocol triggered on every model upgrade. Upgrades that improve on the vendor’s eval can regress on yours, particularly for narrow domains. Treat eval re-runs as an SLA.
What the announcement does change
The 52% number is still useful — just not in the way the press release implies. Two practical reads:
It’s a reasonable signal that this category of regulated-domain accuracy was actually a priority for the next training cycle. That tells you something about where OpenAI’s product roadmap is going, which informs vendor lock-in decisions.
It’s a reasonable trigger to rerun your own eval. If OpenAI says hallucinations dropped, it’s worth checking whether your numbers move in the same direction. If they do, scope expansion may be defensible. If they don’t, the announcement just told you something about how your domain differs from theirs — which is more valuable than the headline number.
The buyer’s read
Vendor accuracy claims are not a substitute for your evaluation regime; they are an input to it. Mid-market firms that build the eval discipline now will be in a position to make confident, defensible scope decisions every time a vendor announces improvement. Firms that rely on the headline numbers will keep being surprised — sometimes pleasantly, sometimes in front of regulators.
The 52% is real. So is the gap between the test set OpenAI ran and the deployment you’re considering. Build the discipline that closes that gap.
Sources
- OpenAI claims ChatGPT’s new default model hallucinates way less — The Verge (accessed 2026-05-05)