Sakana Fugu Ultra: I Watched It Beat Stockfish

The detail that stopped me wasn't a benchmark. It was a chess game played without a board.

No image of the pieces. No coordinate grid. Just a model holding the entire game state in its head, move after move, against a Stockfish engine rated around 2,100 ELO — the kind of strength that beats almost every human club player you'll ever meet. And Sakana Fugu Ultra didn't just survive. It won four games in a row, every one ending in checkmate, against three frontier models and the engine.

That's the moment I realized I'd been thinking about this whole thing wrong. I came into the video on Sakana Fugu Ultra expecting another "new model beats GPT" hype reel, the kind I've learned to discount on sight. What I got instead was a quietly radical argument: maybe the next jump in AI capability doesn't come from a bigger brain. Maybe it comes from a smarter committee.

I want to be upfront about what this post is, because the honesty matters more than the click. I haven't run Fugu's API myself — access is gated, and at launch it's blocked in the EU/EEA while Sakana works through GDPR. So I'm not going to pretend I deployed it on a client project last Tuesday. What I can do is something more useful: take the test results that exist, cross-check them against Sakana's published research, and tell you what they actually mean for how you pick tools in 2026. The numbers below come from the source tests and Sakana's own materials. The judgment is mine.

Let me show you why "is it better than GPT-5.5?" turns out to be the wrong question entirely.

What is Sakana Fugu Ultra, and why isn't it a normal model?

Sakana Fugu Ultra is not a foundation model — it's a multi-agent orchestration system that decomposes a task, routes the subtasks to different specialized LLMs, then critiques, verifies, and synthesizes their outputs into one answer. When you call it through its single API endpoint, you're not talking to one set of weights the way you do with Opus 4.8 or GPT-5.5. You're talking to a conductor that knows which musicians to call for which passage.

Sakana AI is a Tokyo research lab, and Fugu launched on June 22, 2026. The "model" label undersells it. Here's the architecture in plain terms: Fugu takes your prompt, breaks it into pieces, and assigns roles across a swappable pool of frontier models — think Thinker, Worker, Verifier. One model drafts a plan. Another executes a piece of it. A third checks the work. Fugu stitches the results together and hands you the finished output.

The word "swappable" is doing heavy lifting there. Because Fugu routes to models rather than being one, the pool can grow as new frontier models ship — no retraining required. That's a genuinely different bet on where AI value comes from. Most labs are racing to build the single smartest brain. Sakana is betting that coordinating the brains we already have is the cheaper, faster path to more of the wins.

Here's the part most coverage gets wrong, and it changes everything: Fugu's orchestration is learned, not hardcoded. This isn't a router built from if/else logic and a keyword matcher. According to Sakana's research, Fugu is itself a trained language model whose job is to call other LLMs — and it learned how to coordinate them from two ICLR 2026 papers: Trinity (an evolved coordinator that assigns those Thinker/Worker/Verifier roles) and The Conductor (trained with reinforcement learning to discover natural-language coordination strategies). The system learned what to say to each model to make a diverse pool outperform any single worker.

And there's a wild detail buried in there. Fugu can call itself recursively — reading its own prior output, deciding whether its first coordination attempt fell short, and spinning up a corrective workflow. The depth of that recursion becomes a tunable compute axis at inference time. You can spend more thinking by going deeper, without retraining anything. That's a new flavor of test-time scaling, and it's the kind of idea that's obvious in hindsight and almost nobody shipped first.

So when you see Fugu "beat" a frontier model on a benchmark, hold that result up to the light. Of course a system that decomposes, delegates, and verifies does well on tasks that reward careful problem-solving. That's literally what it's built to do. The interesting question isn't whether it wins — it's where it wins, and what it costs you to get there.

That cost question is where the story gets uncomfortable.

The trader-desk test: where the money actually goes

I want to start with the least dramatic test, because it's the most honest one. The brief: build a "live trader desk" — a front-end plus back-end, the kind of multi-component app real people actually ship. Four systems got the same prompt. Here's what they used, as reported in the source:

System	Tokens Used	Cost (USD)	What you got
Fugu Ultra	~22,000	$0.51	Most polished, feature-rich UI — and the priciest
Opus 4.8	~16,000	$0.31	Solid, balanced implementation
GPT-5.5	~11,000	$0.26	Good quality-to-efficiency ratio
Chinchilla 5.2	~13,000	$0.03	Cheapest by far, least design polish

Read that table slowly, because there are two stories in it.

The first story is the one Sakana wants you to see: Fugu produced the best-looking, most complete UI. If "make it impressive in one shot" is the job, Fugu delivered. The orchestration paid off in polish — multiple models cross-checking each other tends to catch the gaps a single pass leaves behind.

The second story is the one that matters for your budget. Fugu cost $0.51 — about 17x what Chinchilla 5.2 charged for a working version of the same thing. It burned the most tokens, too. That's not a bug. That's the architecture. Every time Fugu decomposes a task, routes it, and verifies the result, it's making more model calls than a single model would. Coordination has overhead, and you pay for it in tokens, dollars, and latency.

Here's where I land, and it's not where the marketing wants me to: for a straightforward build, that premium is hard to justify. Chinchilla 5.2 gave you a functional trader desk for three cents. If you need it pretty, Opus 4.8 split the difference at $0.31 with a clean result. Fugu's extra 64 cents over Chinchilla buys you polish — and on a lot of internal tools, nobody's grading the polish.

But "a lot of internal tools" isn't every job. The trader-desk test rewards efficiency, so the efficient tools look smart. Change the task to one that rewards coordination, and the picture flips hard.

The Crossy Road test: when faster and cheaper produces worse

This is the test that reframed the whole thing for me, and it has nothing to do with which system is "smarter."

The task: build a 3D Crossy-Road-style game. Same brief, head to head — Fugu Ultra against Opus 4.8. Here are the reported figures, and I'm presenting them exactly as the source reported them, not as numbers I verified myself:

Dimension	Fugu Ultra	Opus 4.8
Time to build	~22 minutes	~79 minutes
Tokens used	~90,000	~1,000,000
Cost	~$7.32	~$37
Result	Faster, cheaper, but flawed	Slower, pricier, more polished

Fugu was roughly 3.5x faster, used about 10x fewer tokens, and cost about 5x less. Stop and sit with that, because it cuts against the trader-desk result you just read. Here, the orchestrated system was the frugal one.

And yet it produced the worse game. Fugu's Crossy Road clone had inverted turning controls — push right, go left. The camera fought the player. There was no sound. The game was incomplete. Opus 4.8 spent five times the money and nearly four times the wall-clock time, and gave back something more polished and more functional — though still slightly buggy.

So who won? That's the wrong question, and that's the entire point. If you're prototyping fifty game concepts to find the one worth building, Fugu's profile is obviously correct — you want speed and cheapness, and you'll fix the camera on the one idea that survives. If you're shipping the game players will actually pay for, Opus 4.8's polish is worth every extra dollar.

Notice what just happened across two tests. On the trader desk, Fugu was the expensive option. On Crossy Road, Fugu was the cheap option. Same system. The variable wasn't Fugu — it was the task. Orchestration overhead is a fixed tax that pays off enormously on some jobs and bleeds you dry on others, and you cannot know which without matching the task to the architecture.

That's the skill nobody's teaching yet: reading a task and predicting which shape of system fits it. Let me give you the rule of thumb I'd use.

Should you use Fugu Ultra or just pick a frontier model?

Use Fugu Ultra when the task is multi-component, high-detail, and benefits from verification — UI builds, simulations, anything where cross-checking catches errors a single pass misses. Reach for a single frontier model like Opus 4.8 or GPT-5.5 when you need predictable speed, low cost, and a tight feedback loop. The deciding factor isn't capability. It's whether decomposition-and-verification earns back its overhead on this specific job.

Here's the decision I'd actually run through, in order:

Is this a one-shot impressive artifact or a tight iteration loop? One-shot polish favors Fugu's verify-and-synthesize loop. Fast iteration favors a single model — you don't want orchestration latency between every keystroke of feedback.
How long-horizon is the task? This one's important. The reported results show Fugu sometimes lags on broad, long-horizon work — things like Sweep Bench Pro — precisely because orchestration overhead and coordination failure points compound over many steps. More moving parts means more places to break.
What's your cost ceiling, and your polish floor? If you have a hard budget and a forgiving quality bar, a single efficient model wins almost every time. If polish is non-negotiable and budget is flexible, Fugu's extra calls earn their keep.
Do you need it to run in the EU? At launch, Fugu is unavailable in the EU/EEA while Sakana works through GDPR. If your stack or users live there, the decision is made for you.

On raw benchmarks, the source reports Fugu scoring well in engineering, scientific reasoning, coding, and agentic tasks — and often outperforming Mythos 5 on specific benchmarks like Live Code Bench and BBQ Evil, exactly the kind that reward careful decomposition and verification. But it falls short of true frontier models like Fable 5 on messier, real-world tasks. The benchmark wins are real and they're partly an artifact of what orchestration is built to be good at. Both things are true.

One more honest caveat I won't bury: most of the headline benchmark claims are Sakana's own numbers. Self-reported benchmarks from the company selling the product are marketing until independent evaluators reproduce them. I'm not saying they're wrong — I'm saying the burden of proof sits with Sakana, and right now it's only partly met. The third-party test results above (trader desk, Crossy Road) are more trustworthy precisely because they weren't run by Sakana.

If the whole multi-model, ensemble direction interests you, I traced the early version of this pattern in my breakdown of open-source AI ensembles, and I covered Fugu's launch in context alongside the other June releases in my AI model roundup for June 2026. This post is the deep dive on Fugu alone; that roundup is the wider map.

Now — the tests where Fugu genuinely impressed me, and where the orchestration architecture stops being a tradeoff and starts being an advantage.

Where orchestration actually shines: sims, terrain, and a board it can't see

Three results moved me from skeptic to "okay, this is real."

The black hole simulation. The brief was a surrealistic black hole sim — codename "Singularity." Fugu produced a detailed, well-rendered visualization that out-rendered GLM MiniMax and Chinchilla 2.7 Code on visual accuracy. This is exactly the kind of task orchestration should win: rendering a physically-flavored scene correctly involves several sub-problems — the geometry, the lighting, the distortion physics, the surreal styling — and a system that can route each to a capable model and verify the composite has a structural edge over a single model trying to hold all of it at once.

The flight simulator. Same story, different domain. Fugu generated a semi-accurate infinite-terrain flight sim that surpassed GLM 5.2 and MiniMax, both of which returned limited results. "Infinite terrain" is a decomposition problem in disguise — terrain generation, the flight physics, the camera, the render loop — and decomposition is Fugu's home turf.

And then the chess. I keep coming back to this one because it's the cleanest demonstration of what "maintaining state through coordination" actually buys you. Blindfold chess, one-shot, no visual board — the system has to track the entire position in working memory across the whole game. Fugu won four consecutive games against three frontier models and a Stockfish engine around 2,100 ELO, ending every game in checkmate. It held game state and move accuracy better than opponents that, on paper, are more capable.

Why does that happen? Because a verifier in the loop catches the blunder before it's committed. A single model playing blindfold chess has one shot to track the board correctly each move. An orchestrated system can have one component propose a move and another sanity-check the resulting position against the move history. That's not magic — it's the same decompose-and-verify loop, applied to a problem where a single slip loses the game. The architecture's whole reason for existing is to catch the mistake the soloist would make.

If you've read this far, here's the shift I want you to take with you: for years we asked "which model is smartest?" The more useful 2026 question is "which shape of system fits this job?" And "an orchestrator routing across many models" is now a real, shipping answer to that question — not a research curiosity.

What I got wrong about where the next jump comes from

Time for real talk, because a tool review that only lists features is a spec sheet, and you can get that anywhere.

First, I was wrong about the shape of progress. I assumed the next capability jump would come from a bigger single model — more parameters, more training, a fatter brain. Fugu's results suggest a meaningful chunk of near-term progress will come from coordination instead: squeezing more out of the models we already have by routing intelligently between them and verifying the output. That's a humbler, less glamorous form of progress. It doesn't make a flashy "new model" headline. I think it's been underrated for exactly that reason.

Second, the cost axis is now as important as the capability axis, and most coverage still ignores it. Everyone benchmarks intelligence. Almost nobody benchmarks dollars-per-finished-task. The trader-desk and Crossy Road tables are the clearest illustration I've seen that "best" is a budget-dependent word now. When I advise teams, the first question is no longer "which model is smartest" — it's "what's your tolerance for cost versus polish on this job." Most days I'll take the cheaper result and fix the camera myself.

Third — and this is the limitation Sakana won't lead with — orchestration overhead is a real, recurring tax. More model calls mean higher latency, higher cost, and more failure points. Every hop between models is a place the workflow can drop context or misroute. On long-horizon tasks, those failure points compound, which is exactly why Fugu lags on the broadest benchmarks. An orchestrator is only as reliable as its weakest handoff, and it has more handoffs than a single model has. That's not a flaw to be patched away — it's the inherent cost of the design.

If the orchestration pattern has you curious about running one yourself, I've put a couple of these systems through their paces — see my hands-on with the OpenAI Symphony agent orchestrator, which tackles the same coordinate-many-models problem from the coding-harness angle. And if you're weighing whether to wire a multi-agent orchestration layer into your own stack — figuring out where it earns its overhead versus where a single model is the saner call — that's exactly the kind of architecture decision I take on through my Fiverr. The honest answer is usually "use orchestration for the 20% of tasks that genuinely need it, and a fast single model for the rest," and getting that split right is most of the value.

So where does Fugu actually fit? Let me make that concrete.

What to expect if you adopt Fugu Ultra today

I won't invent precision I don't have. But the reported tests, read against the architecture, point to a consistent shape you can plan around.

Expect Fugu to win on multi-component, high-detail, one-shot artifacts — the polished UI build, the rendered simulation, the multi-part generation where verification catches what a single pass misses. The trader-desk UI, the black hole sim, the flight sim, the blindfold chess all share that DNA: several sub-problems that benefit from being split, solved, and checked.

Expect Fugu to lag on long-horizon, open-ended, or cost-sensitive work — broad agentic tasks where overhead compounds, and any job where a cheaper single model gets you 90% of the way for a fraction of the spend. Chinchilla 5.2's three-cent trader desk is the cautionary tale: if you don't need the polish, you're paying a steep premium for it.

Expect to pay more and wait longer than you would with GPT-5.5 or Opus 4.8 on equivalent tasks, as a rule. That's the structural cost of coordination, and it won't fully disappear — though Sakana's recursive-depth idea suggests they at least have a knob for trading compute against quality deliberately rather than blindly.

And expect this to improve. Fugu launched June 22, 2026; it's early. The pool is swappable, so it inherits every new frontier model for free. The coordination is learned, so continued training can sharpen it. The proof-of-concept is already convincing. The question is whether Sakana can close the overhead gap fast enough to make orchestration the default rather than the specialist choice.

For now, my recommendation is unglamorous and, I think, correct: Fugu Ultra is a specialist tool, not an everyday driver. For general application work, GPT-5.5 and Opus 4.8 currently give you a better cost-speed-quality balance. Keep Fugu in your kit for the specific high-detail, multi-component jobs where decompose-and-verify earns its keep — and watch the overhead trend, because if it drops, the whole calculus changes.

Come back to that blindfold chess game one more time. A system that couldn't see the board still won — not because it was the smartest player at the table, but because it had a teammate checking its work before every move. That's the real lesson of Fugu Ultra, and it's bigger than one product. The next era of AI might not be won by the smartest model. It might be won by the best-coordinated team of ordinary ones.

So the question I'd leave you with isn't "is Fugu better than GPT-5.5?" It's this: of the jobs on your plate this week, which ones are you solving with a soloist that actually need a committee?

Is Sakana Fugu Ultra a foundation model or an orchestrator?

Fugu Ultra is an orchestrator, not a foundation model. It decomposes a task, routes subtasks to a swappable pool of frontier LLMs, then verifies and synthesizes their outputs through a single API. Unlike Opus 4.8 or GPT-5.5, it doesn't generate answers from its own weights — it coordinates other models. See the architecture breakdown above for the full picture.

Is Fugu Ultra cheaper than Opus 4.8 or GPT-5.5?

It depends entirely on the task. On a Crossy Road build, Fugu reportedly cost about 5x less than Opus 4.8; on a trader-desk build, it was the most expensive of four systems at $0.51. Orchestration overhead is a fixed tax that pays off on some jobs and bleeds you on others. The decision framework above explains how to predict which.

What benchmarks does Fugu Ultra do well on?

Fugu reportedly scores well on engineering, scientific reasoning, coding, and agentic benchmarks, and often outperforms Mythos 5 on tasks like Live Code Bench and BBQ Evil that reward decomposition and verification. It tends to lag on long-horizon benchmarks like Sweep Bench Pro, where orchestration overhead compounds.

Where is Sakana Fugu Ultra available?

Fugu Ultra is accessible through an API provider and launched on June 22, 2026. At launch it is unavailable in the EU/EEA while Sakana AI works through GDPR compliance. If your users or stack live in Europe, that restriction may decide the question for you.

Did Fugu Ultra really beat Stockfish at blindfold chess?

According to the source tests, yes — Fugu won four consecutive blindfold games (no visual board) against three frontier models and a Stockfish engine rated around 2,100 ELO, ending every game in checkmate. The likely reason is its verify-in-the-loop design, which catches the position-tracking blunder a single model would commit.

Let's Work Together

Looking to build AI systems, automate workflows, or scale your tech infrastructure? I'd love to help.

Fiverr (custom builds & integrations): fiverr.com/s/EgxYmWD
Portfolio: mejba.me
Ramlit Limited (enterprise solutions): ramlit.com
ColorPark (design & branding): colorpark.io
xCyberSecurity (security services): xcybersecurity.io

Sakana Fugu Ultra: I Watched It Beat Stockfish