Skip to main content
AI & Copy

GPT-5.5 Is Here: What's Actually New, What's Hype, and How to Use It Well

OpenAI shipped GPT-5.5 on April 23, 2026 — six weeks after 5.4 and the first fully retrained base since GPT-4.5. It's a state-of-the-art agentic coder, a faster reasoner, and a 2× more expensive model with a real hallucination problem. Here's the honest field guide.

·16 min read

OpenAI shipped GPT-5.5 on April 23, 2026 — two days before this post went up. It's the first fully retrained base model since GPT-4.5; not a fine-tune, not a post-training pass on top of an existing checkpoint, but a from-scratch rebuild of the architecture, the corpus, and the objectives. It arrived just six weeks after GPT-5.4 dropped. That cadence is itself the story.

I've been pushing it through real workloads for the last 48 hours — coding agents, browser automation, long-context document work, content tasks, the kind of jobs we run constantly when analyzing landing pages at roast.page. This isn't a recap of OpenAI's launch deck. It's what the model actually does, where it earns its 2× price hike, where it doesn't, and how to get genuine work out of it.

If you're a founder, a builder, or anyone whose product depends on which frontier model is best right now, this is the field guide.

What 5.5 Actually Is (and Why "Fully Retrained" Matters)

OpenAI's last several models — 5.1, 5.2, 5.3, 5.4 — were post-training iterations on the same base. They got smarter, better-aligned, more efficient, but the foundation was unchanged. 5.5 is the first time since GPT-4.5 that OpenAI tore down the studs. New architecture. New pretraining corpus. New agent-oriented objectives baked in from the start, instead of bolted on later.

You feel this in two places. The first is in agentic loops — the model commits less to dead ends, recovers faster from tool errors, and chooses tools more decisively. The second is in instruction-handling: 5.5 reads multi-part instructions more like a senior engineer reads a ticket. It plans. It picks an order. It checks itself.

OpenAI president Greg Brockman called it "a new class of intelligence" that can "look at an unclear problem and figure out what needs to happen next" with minimal scaffolding. That's the kind of phrase that triggers eye-rolls because every model launch uses it. In 5.5's case there's a measurable behavior behind it: on the company's internal Expert-SWE benchmark — long-horizon coding tasks with a median estimated human completion time of 20 hours — 5.5 outperforms 5.4 by a meaningful margin.

But the rebuild also explains the price. Doubling input/output cost (more on that below) is the kind of thing you only ship after a foundation-level investment. OpenAI is signaling that the unit economics of the new base support a different ARPU than the old one did. Read into that what you will about the next twelve months.

The Benchmarks That Matter, in Plain English

Benchmarks are easy to cherry-pick. Here's the honest read on the ones that actually correspond to work you'd give a model.

Benchmark GPT-5.5 GPT-5.4 Claude Opus 4.7 What it measures
Terminal-Bench 2.0 82.7% 75.1% 69.4% Real command-line workflows: planning, iteration, tool coordination
GDPval 84.9% 83.0% 80.3% Knowledge work across 44 economically valuable occupations
OSWorld-Verified 78.7% 75.0% 78.0% Computer-use tasks: navigating real OS UIs to complete jobs
FrontierMath Tier 4 (Pro) 39.6% 22.9% Postdoctoral-level math problems
SWE-bench Pro 58.6% 64.3% Real GitHub issue resolution end-to-end
Toolathalon 55.6% 54.6% Agentic tool selection and use under uncertainty
BrowseComp (Pro) 90.1% Multi-hop research-grade web browsing
MMMLU (multilingual) 83.2% 91.5% Multilingual knowledge & reasoning
AA-Omniscience hallucination rate 86% 36% How often the model fabricates when it doesn't know (lower = better)

Three takeaways from this table that the launch coverage mostly misses.

One — 5.5 is unambiguously the best agentic-loop model right now. Terminal-Bench, GDPval, OSWorld, BrowseComp, Toolathalon are all tests of "given a goal and tools, can you actually finish the job?" 5.5 wins them all, often by wide margins over Claude Opus 4.7, which had been the consensus pick for agent work since mid-April.

Two — 5.5 is not the best at every kind of coding. SWE-bench Pro, which simulates resolving real GitHub issues, still goes to Claude Opus 4.7 by 5.7 points. If your team's daily work is "fix the bug in this PR," "add this feature to an existing codebase," or anything where the model is reasoning about a fixed code surface, Opus 4.7 is still the call. 5.5 wins when the work is browser-and-terminal-and-tools-shaped, not pure code-shaped.

Three — the hallucination number is a real problem and you have to know about it. On Artificial Analysis's Omniscience benchmark, 5.5 confabulates at an 86% rate when it doesn't know an answer. Opus 4.7 confabulates at 36%. Both numbers are higher than you'd hope, but 86% means that for any factual question the model can't answer from its training, you should assume it will produce something plausible-sounding and wrong. Use grounding (RAG, web search, tool calls) for any factual-recall task. Don't put 5.5 in a self-grading agent loop without an external verifier.

The hallucination caveat in plain language

5.5 is a brilliant doer and a confident bullshitter. The two traits are linked — agentic models are trained to commit and keep moving, which is exactly what makes them productive in tool-use loops and exactly what makes them confabulate when there's no ground truth. Plan your architecture around this: tools and citations for facts, the model for plans and execution.

The Two Variants and How to Choose

5.5 ships in two API endpoints and two corresponding ChatGPT tiers.

GPT-5.5 (Thinking) is the default. It replaced GPT-5.4 in ChatGPT for Plus, Pro, Business, and Enterprise users on April 23. The "Thinking" name refers to the adaptive reasoning baked into the new base — the model decides on the fly how much internal reasoning a problem requires. For most users this is the only variant they'll touch, and it's the right default for almost every workload.

GPT-5.5 Pro is the high-accuracy variant for harder problems. It's available to Pro, Business, and Enterprise tiers in ChatGPT, and through the API at six times the standard price. Pro is what scored 39.6% on FrontierMath Tier 4 and 90.1% on BrowseComp. It's clearly stronger on the hardest problems. It's also clearly overkill for the work most people use ChatGPT for, and using it indiscriminately will burn your budget faster than you can blink.

The clean rule: start with Thinking. Promote to Pro only when you have a specific named gap in Thinking's output that more reasoning would close. "Just in case" is not a reason. "The Thinking output got the structure right but missed the second-order implication of the legal clause" is a reason.

When to actually reach for Pro

  • Legal review where mis-reading a clause has real downside.
  • Multi-source financial analysis where the answer depends on synthesizing across documents.
  • Scientific research workflows — drug discovery, computational biology, anything where the model is helping reason through a real research problem.
  • Long-horizon agent loops where a single bad call early can poison the next two hours of work.
  • Math at the frontier — IMO-style problems, novel proofs, original derivations.

For the kinds of jobs most teams actually run — drafting copy, writing CRUD code, summarizing meetings, fielding support questions — Thinking is plenty and Pro is a waste.

The Pricing Is Doubled. Here's How to Not Pay Double.

API pricing for GPT-5.5 is $5 per million input tokens, $30 per million output tokens. That's exactly 2× GPT-5.4's $2.50/$15. Pro is $30/$180. These are the highest API prices OpenAI has ever charged for a flagship general-purpose model, and they're a deliberate signal: 5.5 is positioned as a productivity multiplier, not a commodity.

That said, you don't have to pay full freight. Three levers can claw most of the cost back, and one architectural choice can prevent it from mattering at all.

Lever 1: Cached input tokens are 90% off. Cached input is $0.50/M, not $5. If your prompt has a stable system block, codebase context, or document corpus that you pass on every request, prompt caching makes the marginal cost of repeat calls vanishingly small. This is the single highest-leverage pricing move and it's the one most teams forget.

Lever 2: Batch and Flex tiers offer 50% off. If your work isn't real-time — content generation queues, offline analytics, evals, backfills — push it through Batch. You give up latency guarantees and get half-price. Most "AI cost" surprises come from running async work on real-time pricing.

Lever 3: Choose the right variant. Pro at $30/$180 is six times Thinking. Using Pro indiscriminately is the easiest way to blow up an AI budget. Audit your workloads. The percentage that genuinely needs Pro is almost always under 10%.

The architectural choice: route by job complexity. Have a cheaper model — Sonnet 4.6, GPT-5.5-mini when it ships, or even open-source like DeepSeek V4-Pro — handle high-volume low-stakes work. Reserve 5.5 for the requests that actually need its capabilities. Most production systems should not route 100% of traffic to a frontier model.

A trade-off worth naming

OpenAI's claim is that 5.5 is "more token efficient" — meaning fewer output tokens to reach the same answer — and that on real workloads it ends up roughly cost-equivalent to 5.4 despite the 2× rate. This is plausible but unverified at scale. Don't take it on faith. Run your own representative slice through both models, measure tokens-per-task, and decide whether the efficiency claim holds for your work. The multiplier varies hugely by task type.

Tip 1: Stop Decomposing Tasks Manually

The biggest behavioral change between 5.4 and 5.5 is that the new model wants to do its own task decomposition. Give it a messy goal. It will plan. It will pick subtasks. It will sequence them. It will use tools to verify. This is the agentic objective that the rebuild was designed for.

If you're still writing prompts that read like "First, do X. Then, do Y. Then, do Z. After that, check W.", you're fighting the model. 5.4 needed that scaffolding. 5.5 doesn't, and adding it actively constrains the model's planner.

The new shape of a great prompt:

5.4-ERA PROMPT

"First, fetch the page. Then, extract the headlines. Then, classify each by tone. Then, group similar headlines together. Then, output a markdown report."

5.5-ERA PROMPT

"Analyze the headlines on this page and produce a markdown report grouping them by tone. Use whatever tools you need. Acceptance criteria: each group named, three to seven members per group, one-line description of why they cluster."

The 5.5-era version gives the model the goal, the constraints, and the definition of done. It lets the model own the path. Outputs are cleaner, latency is shorter, and you avoid the brittleness of micromanaging the planner.

The exception: if you have a hard procedural constraint — "you must call the auth tool before any other tool" — name it explicitly. The model will respect named constraints. What it dislikes is being told how to think when only the what matters.

Tip 2: Lean Into the 1M Context — But Structure It

The full 1M-token context window is a real, usable feature in 5.5, not a paper number. On the MRCR v2 retrieval benchmark, 5.5 hits 74% — meaning at the deep end of the window, it's still finding the right chunk most of the time. That's good enough that you can stop doing aggressive RAG chunking for medium-corpus problems.

But unstructured 1M-token dumps still hurt the model. Three rules from running real workloads through it:

Rule 1: Use hierarchical structure. Sections, sub-sections, clear delimiters. Wrap each document in a tag or a labeled block. The model is better at retrieving from 950K labeled tokens than from 600K unlabeled ones.

Rule 2: Put the question last. Long-context retrieval works better when the model knows what it's looking for at the moment it scans. Put your corpus first, then the prompt.

Rule 3: Cache the corpus. If you're going to ask many questions of the same long context, cache it. The 90% cached-input discount turns a $5/call corpus into a $0.50/call corpus, and the latency on cached reads drops by an order of magnitude. This single move makes long-context workflows practical that would otherwise be theoretical.

Codex's 400K context is smaller than the API's 1M, but it's the right tradeoff for code: throughput and cost matter more than depth in most coding workflows, and 400K still covers most repositories.

Tip 3: Use Codex Fast Mode for Iterative Loops, Not for Final Drafts

Codex with 5.5 has a new Fast mode: 1.5× token-generation speed at 2.5× the cost. That cost-per-token is brutal on a per-call basis, but the math gets interesting in tight iteration loops.

If you're refactoring a function and running tests every 30 seconds, the latency reduction in Fast mode often pays for itself in saved engineer-time. If you're generating a one-shot 10K-token document, Fast mode is just expensive. The dial is "how often will I throw away the output and try again?" — high iteration count means Fast pays back, low iteration count means standard.

Practical rule: I leave Fast on for live debugging, exploratory analysis, and any chat where I'm the bottleneck. I switch off Fast for batch jobs, scheduled runs, and final-output generation.

Tip 4: Don't Trust Self-Verification Alone

5.5 self-verifies natively. The model audits its own work before responding — same architectural feature that pushed Terminal-Bench to 82.7%. This is great in agentic loops but it does not compensate for the hallucination problem.

A self-verifying model can confidently report "yes, I checked, the answer is X" while X is wrong. The verification pass uses the same model that produced the answer; both passes share the same knowledge gap. If 5.5 doesn't know something, both the answer and the verification will be hallucinated together.

The defense is not "ask the model to double-check." The defense is external grounding:

  • For factual claims — call a search tool, cite the source URL, require the model to quote the source.
  • For code — run the code. Run the tests. Don't ship anything that didn't actually execute.
  • For data analysis — read the file with deterministic tools (pandas, jq) before asking the model to interpret. The model is a great interpreter; it's a poor scanner.
  • For research synthesis — require inline citations and verify a random sample by hand.

The agentic-and-confident architecture means 5.5 will do all of this if you ask. It won't do it if you don't. Asking is on you.

Tip 5: The Browser Use Is Genuinely Different Now

Codex with 5.5 can interact with web apps end-to-end: click pages, fill forms, run flows, capture screenshots, evaluate the result. The OSWorld 78.7% number is the headline; in practice the unlock is that you can describe a browser task in natural language and the model will actually finish it.

Examples that work today, in production, that did not work reliably on 5.4:

  • "Log into our admin panel, find users created in the last week, export them as CSV, and email me the file."
  • "Open three competitor pricing pages, take screenshots of each, and produce a comparison table."
  • "Walk through our signup flow as if you were a new user, screenshot any step that takes more than two clicks, and write up what felt confusing."

That third example is, not coincidentally, exactly the kind of work we run constantly when tearing down landing pages at roast.page. 5.5 can do a credible first pass at it autonomously now. It can't replace human judgment on craft and copy — but it can do the legwork that used to consume the first hour of the audit.

Two practical notes on browser use. First, give it explicit credentials and explicit termination conditions ("log out and close the browser when done"). The model is good at the task and bad at remembering housekeeping. Second, screenshot-driven verification is now reliable: ask the model to capture a screenshot at each major step, and review them. Many "the agent claimed it succeeded but didn't" failures are caught by a single screenshot check.

Tip 6: For Marketers and Builders, the Real Unlock Is Multi-Tool Workflows

If you're not building agents and you're not coding, you might be wondering what 5.5 actually does for you. Honest answer: as a chat partner, it's incrementally better than 5.4 — clearer, faster, less verbose. But as a multi-tool worker, it's a step change.

The shape of work that lights up under 5.5 looks like this:

A workflow 5.5 handles end-to-end

Goal: Audit our top 5 competitor pages, extract their pricing pages and headline copy, run our own page through a comparison, and produce a slide deck summarizing what we should change.

Tools needed: browser, file system, slide generator.

Old workflow: 4 hours of analyst time. 5.5 workflow: 30 minutes of model time + 30 minutes of human review.

That's not a marketing claim — it's a measured workload from this week. The reason it works isn't that 5.5 is "smarter" than 5.4 in the colloquial sense. It's that 5.5 holds a multi-step plan in mind across tool calls without losing track. The previous generation could do any individual step. It would lose the thread between steps.

For marketers and growth folks specifically, the workloads where this matters most:

  • Competitive teardowns across multiple pages.
  • Cohort-by-cohort conversion analysis where the model pulls from analytics, runs queries, and synthesizes a narrative.
  • Content production pipelines that touch a brief, a research step, a draft, an SEO pass, and a publish step.
  • Customer interview synthesis where transcripts, segmentation logic, and clustering all need to play together.

If your work doesn't touch multiple tools, 5.5 is a marginal upgrade. If it does, it's the first model that finishes the job without a human stitching the steps together.

Tip 7: Stop Asking 5.5 to Recall Facts. Start Asking It to Find Them.

This is downstream of the hallucination problem, but it deserves its own tip because so many people get it wrong.

5.5 will confidently tell you that a SaaS company was founded in 2018 when it was founded in 2014. It will tell you that a feature exists when it doesn't. It will fabricate a study citation that has the right author and the wrong title. The 86% Omniscience hallucination rate is not a hypothetical — it's a description of what happens when you ask the model what it doesn't know.

The right move is to never ask 5.5 to recall a fact you care about. Always have it find the fact instead.

CONFABULATION-PRONE

"What was the conversion rate Linear reported in their 2025 growth post?"

GROUNDED

"Search for Linear's 2025 growth post, find the conversion rate they reported, quote the sentence, and link the source."

Both prompts are one sentence. Both produce a number. The first is pure recall and the model will guess; the second is a tool-grounded retrieval and the model will succeed (or fail honestly when the source doesn't exist). You haven't added effort. You've changed the model's task from "remember" to "look up," which is a task it's much better at.

Tip 8: Use the New Codex Browser as a Test Harness

One of the quietly powerful additions in 5.5: Codex's expanded browser tool can run a flow you describe and report back on what happened. This is great for testing.

Examples I've used in the last week:

  • "Sign up as a new user, then immediately try to upgrade. Tell me where the friction is."
  • "Open our pricing page on a 4G simulated connection. Tell me what loads after 3 seconds."
  • "Click every CTA on our landing page. Tell me which ones go to a 404 or a slow page."

This isn't QA replacement — automated test suites are still better at deterministic regressions. It's exploration. It's the kind of "go try the product like a user and tell me what's weird" task that used to require a person and now doesn't. For pre-launch checks on landing pages, it pairs well with a structured analysis pass through roast.page: the page-level critique catches the messaging and conversion issues, and the Codex browser pass catches the flow-level friction.

Tip 9: Build Around the Token-Efficiency Claim, Don't Bet on It

OpenAI's framing on 5.5 is that the model uses fewer output tokens to reach the same answer — which is meant to soften the 2× per-token rate hike. The framing is plausible. The actual savings are highly workload-dependent.

For agentic loops with lots of tool calls, 5.5's better planning genuinely cuts total tokens — fewer false starts, fewer "let me try again" cycles. We're seeing 15–30% reductions in total tokens-per-completed-task on agentic workloads. That clawback partially offsets the rate hike.

For chat-style work with single-turn responses, the savings are much smaller — sometimes negligible. A summary task that took 1,500 output tokens on 5.4 might take 1,400 on 5.5. That's not a meaningful efficiency gain at 2× the rate.

The build-around move: measure your real workloads in both models, on a representative sample, before you migrate fully. If your workload sees the agentic efficiency lift, the rate hike is roughly a wash. If it doesn't, you're paying double for incremental quality. The decision should be data-driven, not vibes-driven.

What This Means for OpenAI

The cadence here is what you should pay attention to. Six weeks between GPT-5.4 and GPT-5.5. That follows December 2025, then March 2026 (5.4), then April 2026 (5.5). OpenAI is shipping major model upgrades every ~6 weeks, plus point releases in between.

This is unusual. Until late 2025, frontier model releases came roughly every six months. The shift to a six-week cycle is a deliberate competitive response to Anthropic, which has been releasing Claude updates on a similarly compressed schedule. The narrative on social media in March was that OpenAI had "lost traction" against Anthropic in enterprise. The April release is, in part, a counter-narrative.

The numbers OpenAI dropped alongside the launch make the strategy explicit: 900 million weekly active ChatGPT users, 50 million paid subscribers, 9 million paying business users, 4 million Codex users. Those are the largest reach numbers in the AI industry. The "super app" framing — where ChatGPT, Codex, and the AI browser merge into one productized surface — is OpenAI's argument that the moat isn't the model, it's the surface area of integrated products. Each model improvement upgrades the entire surface at once.

For builders, the practical implication is that OpenAI's model APIs are now pricing themselves like premium products, not commodities. OpenAI's bet is that the agentic capability gap is wide enough to justify a 2× price hike and the user base is sticky enough to absorb it. Whether that holds depends on what Anthropic ships next — Claude Opus 4.7 was a strong response, and the rumored Mythos model lurking in benchmark traces is a known unknown.

Either way, the model layer of the AI stack is moving from "biggest model wins" to "best agentic surface wins." 5.5 is the strongest signal yet that OpenAI knows it.

What This Means for Builders and Marketers

Three implications worth taking seriously.

1. The "AI agent" use case is real now, in a way it wasn't six months ago. Multi-step workflows that touch a browser, a file system, and a few APIs are no longer research projects — they're production-deployable with 5.5 as the engine. If you've been on the fence about whether to build an agentic feature into your product, the model is now good enough to lean into.

2. AI traffic to your landing page is going to keep behaving differently from Google traffic. 5.5's improved browser and BrowseComp scores mean ChatGPT and Perplexity are getting better at finding and citing pages. The companies that design pages for AI visitors — pricing surfaced, specifics front-loaded, deep links in place — will keep capturing the disproportionate value in that traffic. The companies that don't will keep leaking it. The gap between the two strategies is widening, not closing.

3. Your AI cost structure should now route by job complexity. Sending all traffic to 5.5 is the easiest way to triple your AI bill in a quarter. Cheaper, smaller models — including OpenAI's own future 5.5-mini, Sonnet 4.6, and the better open-source options — should handle the bulk of low-stakes work. 5.5 should get the calls that need its capabilities. Most teams are routing too aggressively to the frontier and paying for it.

A side note for landing page operators

If you're shipping pages that AI agents will read and cite, 5.5's improved retrieval and reasoning means the agent is going to be more demanding about page quality. Vague hero copy gets summarized as vague. Missing pricing gets summarized as "pricing not specified." The standards just went up. Run your page through roast.page if you want a structured read on whether an AI agent (or a human) can extract what you actually do from above the fold.

The Honest Bottom Line

GPT-5.5 is the strongest agentic model available right now, and it's not particularly close on the benchmarks that measure end-to-end task completion. It's also more expensive, more confident in its hallucinations, and not always the best for pure code work. The picture is more nuanced than "OpenAI ships frontier model, hooray."

If you build agents, automate workflows, run multi-tool research, or do anything where the bottleneck is the model holding a plan together across steps, 5.5 is now the default. Migrate. Eat the price hike. Lean into the 1M context and the new browser. Cache aggressively to claw back cost.

If your workload is single-turn chat, factual recall, or vanilla bug-fixing in existing codebases, 5.5 is an incremental upgrade — possibly not worth the 2× cost. Stay on 5.4 or look at Claude Opus 4.7 depending on the task shape. There's no shame in not using the newest model. The shame is in paying frontier prices for non-frontier work.

The longer-term takeaway: model releases now happen on a cadence that makes "which model do I use?" a question worth re-asking every six weeks. The right move isn't to pick a permanent winner. It's to build your stack so you can swap models cheaply — abstracted prompts, isolated routing logic, regular evals against your own real workloads. The model layer is becoming the most volatile part of the AI stack, and the teams that treat it that way will outpace the teams that pick a horse and ride it.

Six weeks from now, there will be another release. Probably from OpenAI again, possibly from Anthropic, possibly from someone we haven't been watching closely enough. The model that matters most to your business in July 2026 may not exist yet today. Build for that reality.

For now, GPT-5.5 is the model to know. Use it well, watch its hallucinations, and don't pay for Pro when Thinking will do.

GPT-5.5OpenAIChatGPTCodexAI promptingagentic AIAI for marketersprompt engineering

Curious how your landing page scores?

Get a free, specific analysis across all 8 dimensions.

Analyze your page for free →

Keep reading