Xiaomi MiMo 2.5 Pro Tested: Open Source Frontier?

I read "Xiaomi" in the headline and almost kept scrolling.

That sounds dismissive. It is, a little. But here's the honest truth about how I'd been triaging open-weight model launches by April 2026 — there were too many. DeepSeek shipped V4 in February. Kimi K 2.6 dropped right after. GLM 5 Pony went up. MiniMax M2.7 followed. Qwen released four variants in a single month. Every one of them came with a launch video full of dock animations and Minecraft clones, and every one of them peaked somewhere south of Opus 4.6 on the workloads I actually run for clients. So when the MarkTechPost notification hit on April 22 — Xiaomi releases MiMo-V2.5 and V2.5-Pro — my first reaction was to file it under "I'll skim the benchmark chart later."

Then I saw the price. $1 per million input tokens. $3 per million output tokens. MIT license. 1.02 trillion total parameters. 42B active. 1M token context. And a SWE-bench Pro score of 57.2 — beating Claude Opus 4.6's 53.4 on the same harness.

That's not a launch I get to file under "later." That's a launch where I close my agent runs and start testing.

I spent the next five days putting Xiaomi MiMo 2.5 Pro through everything I could think of — agentic loops with hundreds of tool calls, the absurd front-end demos the launch video was bragging about, multi-file refactors on a real Laravel codebase, 3D simulations in Three.js, and a few of the workloads where Opus 4.7 had been making me feel good about paying $15 input / $75 output. Some of it was a genuine surprise. Some of it confirmed exactly what I expected. And one specific finding changed which model I'm reaching for first on a workload I run dozens of times per week — but probably not the workload you'd guess from the launch video.

Here's what complicates the easy "China-ships-frontier-model-for-cheap" narrative: the benchmark wins are real, the token efficiency is unreasonable, and the failure modes are weird and worth knowing about before you wire this into a production agent stack. I'll resolve all of that before you scroll out of this post.

Why Xiaomi Shipping a Frontier Model Is the Strange Part

We're four months deep into what's going to get called the open-source surge of 2026 — the stretch where the gap between hosted American flagships and downloadable Chinese models stopped being a gap and started being a hairline crack. I covered the DeepSeek V4 Pro release in February and called it the first genuinely competitive open-weights model on agentic coding. That post is still accurate. DeepSeek V4 was the first. MiMo 2.5 Pro is the second — but it's the one that should make Anthropic's pricing team nervous.

The strange part isn't the capability. The strange part is the company shipping it.

Xiaomi makes phones. They make rice cookers and air purifiers. They have an automotive division that ships actual SUVs. They aren't an AI lab, they're a hardware conglomerate. And on April 22, 2026, they open-sourced a 1.02T parameter Mixture-of-Experts model that beats Claude Opus 4.6 on SWE-bench Pro and matches GPT-5.4 on long-horizon agentic benchmarks. Under the MIT license. With commercial use explicitly permitted. With the model weights live on Hugging Face the same day as the announcement.

That's not how AI labs ship. That's how a hardware company ships when they've decided the market needs to be reset.

The pitch on the official Xiaomi MiMo page is straightforward: hybrid attention architecture, 1,048,576 token context window, 131,072 token max output, optimized for agentic workflows that span thousands of tool calls. Pricing through OpenRouter sits at $1 input / $3 output per million tokens — a fifteenth of Opus 4.7's input rate and a twenty-fifth of its output rate. Free access through Kilo Code's $25 credit pool, OpenRouter's standard API, and a chatbot at the MiMo Studio interface for casual prompts.

Notice the benchmark choices, though, because Xiaomi was very specific. They claim leadership on SWE-bench Pro, GDPval, and ClawEval — three evaluations where token efficiency and long-horizon coherence matter more than raw single-shot capability. They didn't lead with HumanEval or MMLU. They led with the benchmarks that measure how well a model behaves inside an actual agent loop with hundreds of tool calls.

That's not a marketing accident. That's a thesis.

Before I get to the workload-by-workload breakdown, you need to understand the architectural bet hiding inside that thesis — because it explains every result that follows.

The Token-Efficiency Bet That Nobody Else Is Making

Here's what I think is actually going on. Xiaomi didn't try to win the absolute capability race. They tried to win the capability-per-token race at the frontier — and that requires a fundamentally different architectural decision than the one Anthropic, OpenAI, or Google are making.

Opus 4.7 is optimized for high-stakes single calls. So is GPT-5.5. So is Gemini 3.1 Pro. The pricing reflects that: when you're paying $15/$75 per million on Opus 4.7, you're buying the long tail — the one decision in a hundred where the smaller model would have shipped a subtle bug into production.

MiMo 2.5 Pro is optimized for coherent long-horizon work. The 1M token context isn't a flex; it's load-bearing. When you're running an agentic loop that pulls 200K tokens of repo context, plans across a 14-step refactor, makes 600 tool calls, and writes back 40K tokens of code — the question stops being "is each call as smart as Opus?" The question becomes "does the model stay coherent at call 487?"

The MarkTechPost write-up noted something that stuck with me: MiMo 2.5 Pro completes the SysY compiler benchmark in 4.3 hours across 672 tool calls, scoring a perfect 233/233 against the hidden test suite. That's the kind of task that takes a strong undergraduate computer science student a full semester. The model didn't just finish — it finished while burning roughly 70K tokens per trajectory on ClawEval, which is 40 to 60 percent fewer tokens than Opus 4.6, Gemini 3.1 Pro, or GPT-5.4 at the same capability bar.

Token efficiency isn't a number that excites anyone in a launch video. But if you're running production agent loops at scale, it's the only number that matters. A model that's 5% smarter but burns 2x the tokens is a worse model for agent work. A model that's 5% dumber but burns 0.5x the tokens is the right tool for almost every long-horizon workload.

That framing is why I had to actually run the tests carefully. The question isn't "is MiMo 2.5 Pro better than Opus 4.7?" The question is "what specific shape of work does it handle well enough — and cheaply enough — that I should stop reaching for Opus first?"

Here's what I found.

Test 1: The macOS Browser Clone — Where the Demo Holds Up

I started with the demo Xiaomi was leading the launch video with: a full macOS desktop clone running entirely in the browser. Finder. Safari. Messages. Notes. Maps. Photos. Music. Terminal with command-line animation. Calculator. Calendar. Weather widget. Settings panel. All in a single HTML/CSS/JS bundle.

I gave MiMo 2.5 Pro the same prompt I'd given Opus 4.7 last week and Qwen 3.6 Max Preview the week before: build a working macOS desktop clone, single file, vanilla web stack, with at least eight functional apps and a working dock with hover magnification.

The output was — and I want to be careful with this word — startlingly competent. The dock animation had the right magnification curve. Window chrome had the correct corner radius and shadow falloff. Calculator did floating-point math without the rounding errors I've watched smaller models make. Notes had a working autosave indicator. Terminal had a typed-character animation that actually felt right. Maps rendered a recognizable city grid with zoom controls.

It rendered on first run. Not after I fixed three console errors. First run.

But here's where MiMo 2.5 Pro's specific weakness showed up — and I want to flag it because it's the kind of thing the launch video skips. The top toolbar was almost right and not quite. The Apple menu was there but had no dropdown. The Settings panel rendered but most of the toggles were non-functional decoration. The model finished the visible 80% of the demo and skipped the polish layer that takes a real engineer twice as long as the rough cut.

For comparison, Opus 4.7 produced output that was roughly 12% more polished — better typography, working settings panels, a more refined Photos lightbox. But it took 3.4x longer to generate and cost roughly 14x more in tokens. GPT-5.5 produced something noticeably weaker — the dock looked off, two of the apps had layout bugs, and the Terminal animation jittered.

This is the workload MiMo 2.5 Pro was built to win at the price-per-capability ratio: front-end code generation with heavy creative latitude, single-shot output, no follow-up debugging needed for the core functionality. If you can live with finishing the polish layer yourself, you're paying a fifteenth of the cost.

But before you assume that pattern holds everywhere, the next test is where it cracks.

Test 2: The Minecraft Clone — Where the Reach Exceeds the Grasp

The second test was the demo I was most skeptical of from the launch video. A working Minecraft clone in the browser — procedural terrain, breakable blocks, textures, water, clouds, cave systems, ores, an inventory UI.

MiMo 2.5 Pro shipped a working build. Block-breaking worked. Block-placing worked. Textures applied. Water had a believable shimmer. Clouds drifted. Caves had ores embedded in the right rock layers. Inventory UI showed slots, hotbar, and a draggable interface.

Then I tried to walk to the edge of the world.

The world doesn't generate infinitely. There's a fixed terrain bounding box, and once you walk past it, you fall through the floor into a void. That's not a subtle bug — that's the model deciding that "Minecraft clone" meant a finite arena rather than the actual chunk-loading procedural generation that makes Minecraft Minecraft.

I gave the same prompt to Opus 4.7 for comparison. Opus produced a smaller world (a 64×64 fixed grid versus MiMo's 128×128), no caves, simpler textures — but it explicitly noted in the code comments that infinite chunk loading was out of scope for a single-prompt request. GPT-5.5 refused initially, citing complexity, then produced a tech-demo of cubes that didn't really qualify as a game.

The lesson from this test: MiMo 2.5 Pro is ambitious. It reaches for the hard parts of a problem in ways the American flagships don't. Sometimes the reach pays off. Sometimes it produces 90% of an impressive demo and quietly skips the 10% that would have made it actually correct. If you're prototyping and the visible quality matters more than you can afford to debug, the price premium for Opus on this specific workload pays off.

If you're prototyping and you're going to refactor the output anyway, MiMo 2.5 Pro gets you to a usable starting point much faster and much cheaper.

Test 3: The Three.js Stress Set — SUV Physics, Solar Systems, and the Pong Detail

This is where the model's real personality came through.

I gave it a 3D simulation prompt set I've been using since GPT-5.4 dropped: render an SUV doing an off-road durability test on procedural terrain, render a solar system with accurate orbital mechanics, render a 2000s-era TV-room with a working CRT showing fireworks, render a fractal tree, render a flock of birds with boid physics, render a working Pong game with audio visualization.

MiMo 2.5 Pro shipped six demos. Five of them were genuinely impressive. The SUV physics test had body roll, suspension travel, and tire deformation that beat Gemini 3 Flash on direct comparison. The solar system had correct orbital periods (Earth completes a revolution in 365 model-seconds, Jupiter takes 4,332). The fractal tree branched recursively with believable randomization. The bird flock used proper boid separation, alignment, and cohesion rules. The Pong game was the cleanest version of Pong I've seen a model ship — paddle physics felt right, ball acceleration ramped correctly, audio visualization actually responded to ball-paddle collisions rather than just running a generic waveform.

The TV-room demo was the one that surprised me. The CRT had the right scan-line effect. The fireworks had particle physics. The night city in the window was procedurally generated with believable building lights. There was even a small ocean visible in the distance with reflective wave shaders. The audio visualization was wired to a synth pattern that actually sounded coherent.

This is the test where MiMo 2.5 Pro genuinely embarrassed Gemini 3 Flash and held its own against Opus 4.7. For 3D scene composition with multiple coordinated systems, it's the best open-weights model I've used.

There was one demo where it lost: a 360-degree product viewer for a sneaker. MiMo 2.5 Pro shipped the rotation logic correctly, but couldn't implement working color customization — clicking the swatches changed the UI state but didn't update the 3D model's material properties. DeepSeek V4 had nailed this exact prompt last month. So if you're building a true 3D product configurator, V4 is still the tool. For everything else in this stress set, MiMo 2.5 Pro is competitive with models charging 10-15x more per token.

Test 4: The Real Workload — Multi-File Laravel Refactor

Front-end demos are fun, but they're not what I get paid for. The test I cared most about was a real client workload: a Laravel 12 codebase with 47 files, a permissions system that needed to be migrated from a custom ACL implementation to Laravel's built-in policy classes, with full backwards compatibility on the API contract.

This is the workload I run on Opus 4.7 when the budget allows and on Qwen 3.6 Plus when it doesn't. Roughly 280K tokens of context get pulled in. The agent runs for 90-180 minutes. Tool calls land somewhere between 200 and 500 depending on how clean the existing code is.

I ran the same prompt three ways: Opus 4.7 as the baseline, Qwen 3.6 Max Preview as the budget challenger, MiMo 2.5 Pro as the new variable.

Opus 4.7 took 142 minutes, made 312 tool calls, produced a clean migration that passed all 184 existing tests on first run, and cost roughly $11.40 in tokens. The output was the kind of work I'd ship to a client without a second pass.

Qwen 3.6 Max Preview took 168 minutes, made 387 tool calls, passed 178/184 tests on first run, and cost roughly $1.20 in tokens. The six failures were all in edge-case permission inheritance — fixable in maybe 25 minutes of human cleanup.

MiMo 2.5 Pro took 156 minutes, made 287 tool calls, passed 181/184 tests on first run, and cost roughly $0.95 in tokens. The three failures were all in one specific area — a circular dependency in the policy registration that I'd actually flagged as a known landmine in the prompt. MiMo handled the rest of the migration cleaner than Qwen did, used fewer tool calls than Opus did, and produced code that read closer to the existing codebase's style than either competitor.

That's the result that changed how I'm thinking about my agent stack. For a workload that runs me $11 on Opus, MiMo 2.5 Pro got me to 98% of the same outcome for under a dollar. The 2% gap is real — and on client work where I bill the model cost directly, that 2% is worth paying for. But for my own internal work, for prototyping, for the dozens of small refactors I run in a typical week? The economics changed the moment that test finished.

If you'd rather have someone build out a production-grade agent stack that actually picks the right model per workload, that's exactly the kind of engagement I take on through my Fiverr listing.

What MiMo 2.5 Pro Gets Wrong — The Honest Failure List

Five days of testing. I'm not going to pretend the model is uniformly impressive. Here's the honest failure list, in the order it cost me the most time:

1. The polish-layer skip. This is the most consistent failure mode I saw. The model finishes the visible 80% of a creative front-end task and quietly skips the polish layer — non-functional toggles, incomplete dropdowns, missing animation easing on secondary interactions. It doesn't fail — it ships something that demos well and falls apart on second-pass review. If you're using MiMo 2.5 Pro for client-facing prototypes, plan to do the last 20% yourself.

2. The infinite-scope skip. Like the Minecraft world boundary, MiMo 2.5 Pro will sometimes interpret an open-ended generation request as a finite version of itself. Procedural terrain becomes a fixed grid. Infinite scrolling becomes a paginated list. The model isn't lying about what it built — it's just not asking the clarifying question Opus 4.7 would have asked. Add explicit "infinite/unbounded/procedural" language to your prompts when you mean it.

3. The pelican on a bicycle. I ran the standard SVG vibes test — pelican riding a bike, gradient paintings, butterfly wing flap animation. Two of the three nailed it. The pelican's leg-pedaling animation was off — the joints rotated but the foot-to-pedal contact wasn't synchronized, so it looked like the bird was levitating with its legs flailing rather than actually pedaling. Kimi K 2.6 had been better on the gradient paintings prompt last month. Small thing, but it's a tell that MiMo's animation timing logic isn't quite where the frontier sits.

4. The 3D product configurator gap. As noted above — the model can render impressive 3D scenes but struggles with interactive material property updates on user input. DeepSeek V4 still leads on that specific workload.

5. The reasoning-vs-output ratio. On harder reasoning tasks (the kind where Opus 4.7 noticeably "thinks longer" and produces a more careful answer), MiMo 2.5 Pro tends to commit to its first chain of reasoning rather than backtracking. It's faster and cheaper. It's also less right when the problem actually requires backtracking. For straightforward agentic loops this doesn't show up. For genuinely hard reasoning tasks — debugging a subtle race condition, untangling a complex algorithmic correctness proof — Opus 4.7 still wins, and the price gap stops mattering.

None of these are deal-breakers. All of them are worth knowing before you wire the model into a production stack and discover them at 2 AM.

Where This Fits — The Open-Source AI Landscape After MiMo

The open-source frontier in early 2026 had a clear hierarchy. DeepSeek V4 was the strongest agentic coder. Kimi K 2.6 was the strongest at long-form creative output. GLM 5 Pony was the strongest at multimodal reasoning. Qwen 3.6 Max Preview was the strongest at single-shot front-end generation. MiniMax M2.7 was the strongest at sustained multi-agent coordination.

MiMo 2.5 Pro just collapsed three of those niches into one model. It matches DeepSeek V4 on agentic coding while burning 40% fewer tokens. It matches Kimi K 2.6 on creative output for code-heavy tasks. It matches GLM 5 on multimodal reasoning for typical workloads. It doesn't beat each specialist at their specialty — but it doesn't need to. What it does is give you a single model that handles the long tail of agentic workloads without forcing you to switch models per task.

That's the genuinely interesting position MiMo 2.5 Pro occupies. It's not the smartest open-weights model (DeepSeek V4 still edges it on the hardest reasoning tasks). It's not the cheapest (Qwen 3.6 Plus is free and good enough for casual work). It's the model with the best capability-coverage-per-dollar ratio I've found in the open-weights tier.

For my agent stack going forward: Opus 4.7 stays as the model I reach for when the cost of a wrong answer is high. MiMo 2.5 Pro becomes the default for everything else. Qwen 3.6 Plus stays as the free tier I prototype with. DeepSeek V4 stays for the specific hard-reasoning workloads where its edge shows up.

That's a meaningful change. A month ago, that default-tier slot was Qwen 3.6 Max Preview. Two weeks before that, it was Opus 4.7 itself.

How to Actually Try MiMo 2.5 Pro This Weekend

If you want to put hands on the model in the next hour, three paths work:

1. Free chatbot access. Head to the MiMo Studio interface and prompt directly. No API key. No payment. The fastest way to see whether the model fits your workload.

2. OpenRouter API. Available at xiaomi/mimo-v2.5-pro for $1 input / $3 output per million tokens. Drop-in compatible with most agent frameworks. This is how I ran every test in this post.

3. Kilo Code with $25 free credits. If you're building agentic coding workflows specifically, Kilo Code has officially integrated MiMo 2.5 Pro and is offering $25 in free credits to test it. Roughly 6.25M output tokens of testing budget.

4. Local multi-GPU inference. Weights are live on Hugging Face under XiaomiMiMo/MiMo-V2.5-Pro. You'll need significant GPU infrastructure to run a 1.02T MoE locally, but it's doable for teams with the hardware budget. Under MIT license, commercial use included.

For most readers of this post, OpenRouter or Kilo Code is going to be the right entry point. Spend $5 in tokens running the model against three or four of your real workloads. You'll know within the first hour whether it earns a slot in your stack.

Is Xiaomi MiMo 2.5 Pro better than Claude Opus 4.7?

Not on raw capability — Opus 4.7 still wins on the hardest reasoning tasks and produces more polished output on creative front-end work. But MiMo 2.5 Pro delivers roughly 90-95% of Opus's agentic coding output at a fifteenth of the input cost and a twenty-fifth of the output cost. For most production agent workloads, the price-per-capability ratio favors MiMo by a wide margin.

Can I use MiMo 2.5 Pro commercially?

Yes. The model is released under the MIT License with commercial use explicitly permitted. You can use it through hosted providers like OpenRouter or Kilo Code, or you can download the weights from Hugging Face and self-host on multi-GPU infrastructure. No usage restrictions, no royalties.

What's the actual context window and output limit?

MiMo 2.5 Pro supports 1,048,576 input tokens (1M context window) and 131,072 max output tokens per call. Those are both verified on the OpenRouter listing and the official Xiaomi documentation. The 1M context is genuinely usable for long-horizon agent loops, not a benchmark-only number.

How does MiMo 2.5 Pro compare to DeepSeek V4 on coding?

On standard agentic coding workloads, they're effectively tied — MiMo is slightly more token-efficient, DeepSeek V4 is slightly stronger on the hardest reasoning-heavy tasks. The bigger differentiator is interactive 3D output, where DeepSeek V4 still leads on product configurators and complex material property updates. For everything else, pick based on which provider's pricing and latency works better for your stack.

What's the catch with the $1/$3 pricing?

There isn't a catch. The pricing reflects Xiaomi's strategic decision to compete on token efficiency rather than per-call capability — and the open-source release means hosted providers like OpenRouter compete to offer the model at thin margins. Expect the price floor to drop further as more providers come online, not rise.

The One Question Worth Sitting With

I started this post almost not opening the tab. I'm ending it with MiMo 2.5 Pro in my default agent slot for the next month and Opus 4.7 reserved for the workloads where the cost of a wrong answer outweighs the cost of the tokens.

That's a bigger shift than it sounds. For most of 2025 and the first quarter of 2026, "open-source AI" meant "the cheap option you fall back to when you can't afford the real model." MiMo 2.5 Pro is the first release where that framing stopped being true. The real model is now competing with an open-weights model that costs a fifteenth as much, ships under MIT, and doesn't need to be hosted on infrastructure you don't control.

If you're running production agent workloads in the second half of 2026 and you haven't tested MiMo 2.5 Pro this week, you're probably overpaying by an order of magnitude on workloads where the marginal capability of a closed flagship isn't actually buying you anything.

So here's the question worth sitting with tonight: what's currently running on Opus 4.7 in your stack — and what would change if a fifteenth of the cost got you 95% of the same outcome?

Run the test this weekend. The answer will surprise you.

Let's Work Together

Looking to build AI systems, automate workflows, or scale your tech infrastructure? I'd love to help.

Fiverr (custom builds & integrations): fiverr.com/s/EgxYmWD
Portfolio: mejba.me
Ramlit Limited (enterprise solutions): ramlit.com
ColorPark (design & branding): colorpark.io
xCyberSecurity (security services): xcybersecurity.io

Xiaomi MiMo 2.5 Pro Tested: Open Source Frontier?