GPT 5.5 vs Opus 4.7: I Tested Both. Here's What Won.

The tab said 250,000 output tokens. I refreshed the dashboard twice. Same number.

Opus 4.7 had just finished building a solar system simulation in twelve minutes. Same prompt, same task, GPT 5.5 had finished the same job in ten minutes flat — and used 70,000 output tokens. Not 200,000. Not 150,000. Seventy thousand. A 3.5x difference for output that, on the surface, looked nearly identical. Both had clickable planets. Both had adjustable orbital speeds. Both rendered without errors on the first run.

That's the moment I realized the GPT 5.5 vs Opus 4.7 conversation isn't going to be settled by benchmark scores or marketing decks. It's going to be settled by token bills and runtime clocks. And the answer everyone is dancing around is more interesting than "X is better than Y" — because it depends entirely on what you're actually trying to ship.

I spent the last week running both models through four production-style builds: a personal brand site, a solar system simulator, a 3D space shooter, and an ecosystem simulation that genuinely stress-tests one-shot reasoning. Real prompts. Real cost tracking. Real output you can run. This post is what I learned, what surprised me, and which model I'd actually reach for tomorrow morning if I had to ship something by lunch.

Stay with me to the third experiment. That's where my assumption about which model "wins" got demolished in real time.

Why This Comparison Actually Matters Right Now

The release cadence has gone insane. GPT 5.4 dropped in February. Opus 4.7 landed shortly after. Now GPT 5.5 — codenamed "Spud" in the leaks — arrives roughly six weeks after its predecessor with a price hike, a token-efficiency pitch, and a benchmark sheet that puts it ahead of Opus 4.7 on Terminal Bench 2.0 by 13.3 points (82.7 vs 69.4).

If you're a solo developer or running an agent stack that bills by the token, every model swap forces a unit-economics review. Doubling your input price from $2.50 to $5.00 per million tokens isn't a footnote — it's a line-item that hits your monthly bill. The pitch from OpenAI is that GPT 5.5 uses fewer output tokens per task, so the per-task cost evens out or comes in lower. That's the claim. I wanted to test it on real work, not synthetic benchmarks.

Here's what makes this round different from the GPT 5.4 vs Opus 4.6 comparison I ran a few months back. GPT 5.5 isn't pitching raw intelligence anymore. It's pitching autonomous decomposition — the ability to take a vague prompt, identify the ambiguities, and execute the next steps without coming back to ask you fifteen clarifying questions. That's a behavior change, not a capability change. And behavior changes are notoriously hard to evaluate from a press release.

If you've been on Claude Code for the last six months, you know Opus 4.7 has its own personality. It writes more verbose code. It explains itself constantly. It produces output that's often visually polished but token-expensive. The question I went into this week with: does GPT 5.5's "more with less" actually deliver, or is it a marketing slogan stretched over a 2x price hike?

But before I share the results, you need to know how I structured the test — because the headline numbers don't mean what they look like at first glance.

The Test Setup: Four Builds, Identical Prompts, Honest Math

Every comparison test I've ever read suffers from the same problem: cherry-picked tasks. Someone runs three prompts, posts the screenshots that flatter their preferred model, and calls it a comparison. I wanted to do this differently.

I picked four tasks that span the actual range of what I build with these models on a normal week:

A personal brand website — design-heavy, judgment-driven, multiple iterations expected
A solar system simulation — physics, animation, math, interactive controls
A 3D space shooter game — gameplay logic, controls feel, real-time rendering
An ecosystem simulation — complex emergent behavior, the hardest one-shot

For each task, I wrote one prompt. Same prompt for both models. No iterating, no follow-ups, no "actually, can you fix this." Just the prompt, the output, and the bill.

I tracked four numbers on every run:

Runtime: wall-clock time from prompt submission to final output
Input tokens: what the model consumed (prompts + retrieved context)
Output tokens: what the model wrote back
Estimated cost: input × $5/M + output × $30/M for GPT 5.5, input × $5/M + output × $25/M for Opus 4.7

Pricing reference for context: GPT 5.4 ran at $2.50/$15. GPT 5.5 ships at $5.00/$30 — exactly double. Opus 4.7 sits around $5.00/$25, which makes its output token roughly $5 cheaper per million than GPT 5.5. So if GPT 5.5's pitch about token efficiency is real, it has to make up that pricing delta with significantly fewer output tokens. The math is brutal and honest.

One more setup note. GPT 5.5 ran through Codex (OpenAI's coding harness with tool-calling, multi-agent execution, and the new reusable workflows). Opus 4.7 ran through Claude Code. Both are the canonical environments for their respective models. Comparing the bare APIs would have been unfair to both — these models are designed to be used inside their harnesses.

Now, the results. Let's start with the build where the gap was the most embarrassing.

Experiment 1: The Personal Brand Website Build

The prompt: "Build a dynamic personal brand website with an animated hero section, a verified work history with context maps, a project showcase, and a contact form. Use modern design conventions. Make it production-ready."

Vague on purpose. This is exactly the kind of prompt where autonomous decomposition is supposed to shine — or fall apart.

GPT 5.5 (Codex) finished in roughly four minutes. It produced a site with verification loops on the work history (clicking each role expanded a context map showing related projects), an interactive timeline, and a contact form with proper validation. The design wasn't going to win awards, but it was coherent and shippable. Estimated cost: about $1.

Opus 4.7 (Claude Code) finished in roughly fourteen minutes. The site was prettier — smoother animations, a nicer typographic hierarchy, more deliberate color choices. But there were minor UI bugs: a hover state that didn't unstick, a contact form button that overlapped its container on mobile. Fixable, but real. Estimated cost: about $5.

A 3.5x runtime difference. A 5x cost difference. For output that, after fifteen minutes of QA, was roughly equivalent in shippability.

Here's the part that surprised me. GPT 5.5 used dramatically fewer output tokens not because it wrote less code, but because it wrote tighter code. Looking at the diffs side by side, Opus 4.7 generated more comments, more whitespace, and more "explanation" inside the code itself. GPT 5.5's output read like a senior engineer who's been told to optimize for clarity over self-documentation. Less prose, more signal.

I'd been assuming "fewer output tokens" meant "less detailed output." It doesn't. It means different code style, with the same functional surface area. That distinction matters because it means GPT 5.5's efficiency isn't coming at the cost of completeness — it's coming from compression.

Pro tip: if you're billing clients by the hour for AI-assisted builds, this single experiment justified my Codex subscription for the month. Three and a half times faster on bread-and-butter work is not a benchmark improvement. It's a workflow change.

But that was the easy build. The next one had a twist.

Experiment 2: Solar System Simulation — Where Opus Fought Back

The prompt: "Build an interactive solar system simulation with adjustable orbital speed, clickable planets that show information panels, and accurate relative scaling."

This is where I expected GPT 5.5 to win again on speed. It did. Marginally. Maybe 30 seconds faster on a roughly 8-minute task.

But the breakdown got interesting fast.

Metric	GPT 5.5	Opus 4.7
Runtime	~7m 30s	~8m
Input tokens	More than Opus	Less
Output tokens	Fewer than Opus	More
Cost	Higher by ~$1	Lower
Visual quality	Functional	Better proportions

GPT 5.5 ate more input tokens because Codex's tool-calling fanned out — multiple file reads, multiple search calls, more context shuffling. Opus 4.7's harness was more conservative on retrieval. The result: GPT 5.5 was slightly faster but slightly more expensive, and Opus 4.7's output had visibly better orbital proportions and a more pleasing color palette.

This is the experiment that broke my "GPT 5.5 wins on cost" thesis. Output tokens are only half the bill. Input tokens count too, and Codex's more aggressive tool use means GPT 5.5 can lose on input cost even when it wins on output. The total economics depend on the harness, not just the model.

If I had to ship a solar system widget for a client, I'd reach for Opus 4.7's version. It looked more like something a designer would approve. That's not a benchmark you can measure with a number. It's a judgment call. And on judgment calls about visual proportion, Opus 4.7 still has an edge that GPT 5.5 hasn't fully closed.

But I want to flag something honest: the $1 cost difference here is rounding error for most projects. If you're building one-off prototypes, this isn't the experiment that should drive your model choice. The next one is.

Experiment 3: 3D Space Shooter — The Result That Changed My Mind

I went into this experiment expecting Opus 4.7 to win. Game logic, real-time controls, sound effects — this felt like territory where Opus's deeper reasoning would pay off.

It didn't.

The prompt: "Build a playable 3D space shooter with smooth player controls, enemy ships that fire back, particle effects on hits, sound effects, and a score system. Make it actually fun to play."

GPT 5.5 finished in roughly four minutes. The controls felt smooth. The physics were tight — projectiles arced naturally, enemy ships dodged in believable patterns, the camera tracked without jitter. I played it for about ten minutes before I remembered I was supposed to be testing it. It was actually fun. Estimated cost: under $3.

Opus 4.7 finished in roughly six minutes. The sound effects were better — more variety, better mixing, more dramatic explosion audio. But the controls felt clunky. Player movement had input lag. The shooting cooldown was mistuned. Enemy AI shipped with a bug where ships would freeze briefly when you moved behind them. Estimated cost: about $4.50.

I tested both versions twice to make sure I wasn't biased. Same result. GPT 5.5's game was better in the parts that matter most for a game — moment-to-moment feel — and Opus 4.7's was better in production polish that wouldn't matter if the gameplay was broken.

This is where I had to update my mental model of what these two models are good at. I used to file Opus 4.7 under "better at hard reasoning" and GPT 5.5 (well, Codex more broadly) under "better at fast iteration." That's not quite right anymore. GPT 5.5 is better at game-feel decisions because it makes more aggressive default choices about timing, response curves, and feedback loops. Opus 4.7 hedges more — it adds more configuration, more "would the user want this tunable?" — and the result is code that's more flexible but ships with worse defaults.

For game prototyping, that's a problem. Defaults matter. Players don't tune sliders before deciding if your game is fun.

If you've made it this far, you're already getting the version of this comparison that benchmark posts won't give you. The actual experience of using these models on real builds maps badly to the leaderboard. Which brings me to the last experiment, where both models humbled me.

Experiment 4: Ecosystem Simulation — Where Both Models Hit a Wall

The prompt: "Build an interactive ecosystem simulation with predators, prey, and plants. Include reproduction, hunger, aging, and death. The population should reach a stable equilibrium without manual intervention."

This is the one-shot prompt every AI coding model fails. I know it fails. I included it because watching how a model fails is more informative than watching it succeed.

GPT 5.5 ran for about ten minutes. It produced a working simulation with all the requested entities. The population dynamics were broken — predators died out within thirty seconds because their hunger consumed faster than their reproduction rate. Used roughly 2x the input tokens of Opus, but significantly fewer output tokens. Net cost: slightly higher than Opus.

Opus 4.7 ran for about twelve minutes. Its simulation had the opposite failure: prey reproduced too fast and the screen was overwhelmed within forty seconds, then the framerate collapsed. Fewer input tokens, more output tokens. Slightly cheaper overall.

Neither got equilibrium. Neither was usable without follow-up iteration. But the way they failed told me something useful about each model.

GPT 5.5's failure was math-shaped. The simulation logic was structurally correct — it just needed parameter tuning. The hunger curve, the reproduction rate, the aging formula. Numbers I could adjust in five minutes.

Opus 4.7's failure was structural. The simulation had a logic error in how reproduction was triggered — every entity above a fitness threshold reproduced every tick, instead of every N ticks. To fix it, I'd need to refactor the loop. Twenty minutes minimum.

This matches a pattern I've seen across the week. When GPT 5.5 fails, the failures tend to be tunable. When Opus 4.7 fails on complex one-shots, the failures tend to be architectural. That's not always true. But often enough that I now factor it into which model I reach for. Tunable failures are cheap to recover from. Architectural failures cost real time.

There's a third lesson buried in this experiment. Both models burned roughly $3-5 producing a broken simulation. If you're using these tools for actual research-grade complex tasks, you need to budget for two-to-three iteration cycles, not one. Anyone selling you "one-shot to production" is selling you the demo, not the workflow.

The Aggregate Numbers Across All Four Experiments

Here's the total damage across all four builds.

Metric	GPT 5.5	Opus 4.7
Total runtime	~20 min 49 sec	~40 min 43 sec
Total input tokens	~2.7 million	~2.5 million
Total output tokens	~70,000	~250,000
Total cost (estimated)	Lower by ~$3	—

GPT 5.5 finished the same four builds in roughly half the wall-clock time. Used 3.5x fewer output tokens. Came in marginally cheaper despite the doubled price-per-token. That's the headline result, and it's real.

But here's the part the headline numbers hide. GPT 5.5's input token consumption was higher. If you're running a workload that's input-heavy — long context windows, big codebase loads, document analysis — that delta matters. Opus 4.7's 1M-token context window also dwarfs GPT 5.5's 400,000-token cap. For codebase-wide refactoring on a real production app, Opus 4.7 still has the larger working memory.

I covered the million-token context unlock in detail in my GPT 5.4 review from earlier this year — and most of what I said there about context-window discipline applies to GPT 5.5's 400K cap with even more force. You can't load an 800K-token Laravel monolith into GPT 5.5. You can into Opus 4.7. For some teams, that single fact ends the comparison.

The Four Things GPT 5.5 Actually Improved

OpenAI is pitching four core advantages with the 5.5 release. After a week of testing, here's how they actually hold up.

Token efficiency. Real. Confirmed. Across four wildly different builds, GPT 5.5 used a fraction of the output tokens Opus 4.7 used for equivalent functional output. This is the single biggest economic improvement and it's not marketing — it shows up in your invoice.

Autonomous decomposition. Partially real. On vague prompts, GPT 5.5 makes more confident default choices and asks fewer clarifying questions. On the personal brand site experiment, this saved noticeable time. On the ecosystem simulation, it made wrong default choices that cost time. Net: useful, but trust it less on truly novel work.

Codex upgrades. Real. Multi-agent parallelism inside Codex moves visibly faster than the previous version. Reusable workflows are a genuine quality-of-life improvement if you've built up a personal library of patterns. The tool-calling is more reliable than what shipped with GPT 5.4.

Cybersecurity focus. I couldn't fully test this in a week. Anthropic and OpenAI both claim improved adversarial robustness in their flagship models. If you care about this, run your own red-team prompts. Don't trust either marketing department on it. I covered some of the broader security-and-AI tension in the AI zero-day discovery debate post earlier this year — it's worth a read if model security matters to your stack.

The Honest Trade-offs Nobody Is Posting

Time for the part where I tell you what I'd say to a friend over coffee.

GPT 5.5's price doubling matters more than the marketing wants you to believe. Yes, the output token efficiency makes it competitive on a per-task basis for some workloads. But if you're billing through the API directly without Codex's smarter context management, you can absolutely run up bigger bills than you did on GPT 5.4. The "more with less" claim has caveats. The biggest caveat: it depends on the harness doing its job.

Opus 4.7's million-token context is still a moat. For anyone working in large codebases — not toy projects, real production systems with hundreds of files — Opus 4.7's context window changes what you can ask. I wrote about this extensively when I tested Opus 4.7's first builds against my own workflow and the conclusion still holds. Context size is a capability, not a feature. GPT 5.5's 400K is plenty for most tasks. For the tasks where 400K isn't enough, you need Opus 4.7. There's no workaround.

The release cadence is exhausting. Six weeks between major model releases means the workflow you optimized last month might not be optimal this month. I don't have a fix for this. I've started defaulting to "test the new model on three real tasks I've already done with the previous model, then make a decision." Anything more elaborate burns hours I don't have. If you're trying to keep up with every release in real time, you'll burn out before you ship anything meaningful.

The benchmarks are useful but limited. GPT 5.5 leads on Terminal Bench 2.0 by 13 points over Opus 4.7. It also leads on Frontier Math and Cyber Gym. Those are real signals. But "leads on benchmark X" doesn't translate to "is better for your workflow." My space shooter experiment is the cleanest example — GPT 5.5 won the playability test in a way that no benchmark would have predicted.

If you want a deeper look at the broader Codex ecosystem and how it stacks up against the Claude Code workflow, I went through the two-agent workflow comparison and the Codex-vs-Claude-Code subscription math in earlier posts. The conclusions in both still apply with GPT 5.5 — they just got slightly more interesting.

What To Use When: The Decision Tree I'm Actually Following

After this week of testing, here's the rough framework I'm running my own work through.

Use GPT 5.5 (via Codex) when:

You need to ship fast and the project fits in 400K tokens of context
Game-feel, micro-interaction, or moment-to-moment feedback matters
You want autonomous behavior on vague prompts
You're billing by hours saved, not by tokens consumed
The task benefits from reusable workflow patterns

Use Opus 4.7 (via Claude Code) when:

The codebase is too large for GPT 5.5's context window
Visual proportion, design taste, or typographic care matters
You'd rather have flexible, configurable code than opinionated defaults
You need the model to explain its reasoning extensively
You're operating on long-running agent sessions where context retention matters

Use both in parallel when:

The task is genuinely complex and you want two competing solutions to compare
You're not sure which model will produce the better default and you can afford the bill
You're building agent workflows that route different sub-tasks to different models

That last bullet is where I think the next year of AI coding actually goes. Not "which model is best." Routing logic that picks the right model for each sub-task within a larger workflow. I've been experimenting with this and it's promising but early. Topic for another post.

GPT 5.5 vs Opus 4.7: Common Questions

Is GPT 5.5 better than Opus 4.7 for coding?

GPT 5.5 is faster, more token-efficient on output, and slightly cheaper across most coding tasks. Opus 4.7 has a larger context window (1M vs 400K tokens) and produces more design-aware output. For projects that fit in 400K tokens, GPT 5.5 has the edge. For large codebases, Opus 4.7 still wins.

How much does GPT 5.5 cost compared to GPT 5.4?

GPT 5.5 doubled GPT 5.4's pricing — $5/M input and $30/M output, up from $2.50/M input and $15/M output. The pitch is that fewer output tokens per task offsets the unit price. In my testing, this held true for most workloads.

What is the context window for GPT 5.5?

GPT 5.5 supports a 400,000-token context window. Opus 4.7 supports 1 million tokens. For most coding tasks, 400K is sufficient. For codebase-wide refactoring on large production systems, Opus 4.7's larger context is still the better fit.

Can GPT 5.5 replace Claude Code with Opus 4.7?

For some workflows, yes — particularly fast prototyping, game development, and tasks where game-feel matters. For long-running agent sessions, large codebases, or design-heavy work, Opus 4.7 inside Claude Code still has advantages. I run both in parallel and route tasks based on the decision tree above.

Does GPT 5.5's autonomous decomposition actually work?

It's real but uneven. On vague prompts for well-trodden problems (websites, common simulations), it makes confident default choices that save time. On genuinely novel work like ecosystem simulations, it makes wrong defaults that cost time. Trust it more on familiar territory, less on novel territory.

What I'm Doing Next Monday Morning

Here's the practical answer.

I'm leaving Opus 4.7 as my default for the agent that does cross-codebase refactoring on a large Laravel project I maintain. The 1M-token context is doing work nothing else can do, and that's not changing this week.

I'm switching my prototyping workflow to GPT 5.5 through Codex. The 3.5x speed improvement on the personal brand site experiment alone justifies it for client work where speed matters more than visual polish. For the polish pass, I'll bring Opus 4.7 back in.

I'm setting up an A/B test for the next two weeks where I run every new prompt through both models in parallel and track which output I actually ship. I'll have real data instead of vibes by mid-May. If you want the follow-up post, you'll find it on mejba.me when I publish it.

The thing I didn't expect to feel after this week of testing: relief. Not because one model decisively won. Because both models are now good enough that the choice between them is a workflow optimization, not a quality compromise. We're past the era where you had to pick the best model and live with its weaknesses. Now you pick the right model for the task. That's a much better problem to have.

The real question isn't "GPT 5.5 vs Opus 4.7." It's "which one should you reach for at 9 AM tomorrow when you have a thing to build by 5 PM?" Answer that for yourself based on the four-experiment framework above and you'll save more money than any benchmark sheet will tell you.

Now go run your own four experiments. Don't trust mine. Don't trust anyone's. Run yours.

Which Model Actually Wins for You

After four real builds, the honest answer is that GPT-5.5 and Opus 4.7 win different jobs — one on token efficiency and cost, the other on polish and reasoning depth. The verdict nobody says out loud is that "best" depends entirely on which of those you're optimizing for.

Run both on your own hardest build, not a benchmark, and let the token bill and the output quality decide. That's the only comparison that pays your invoices.

If you want help choosing and routing models for cost and quality, that's something I do through Ramlit. My head-to-head results are above.

GPT 5.5 vs Opus 4.7: I Tested Both. Here's What Won.