MiniMax M3: The Open-Weight Model That Stunned Me
A model spent 24 hours rewriting a GPU kernel by itself, made 147 benchmark submissions, fired off nearly 2,000 tool calls, and pushed hardware utilization from 7.6% to 71.3% without a single human keystroke. That's roughly a 9.4x speedup on an FP8 CUDA kernel for NVIDIA Hopper GPUs — starting from a Triton skeleton that didn't even run.
The model that did it is MiniMax M3. And it's open-weight.
I want to sit with that second sentence for a moment, because it's the part that made me stop scrolling. We've gotten used to a particular arrangement in AI: the frontier-tier capabilities live behind closed APIs from a handful of US labs, and the open models trail a generation or two behind, useful but never genuinely competitive at the top. MiniMax M3, which the MiniMax team launched on June 1, 2026, is the first open-weight release I've looked at this year that seems built specifically to break that arrangement — frontier coding, a one-million-token context window, and native multimodality, all in a single model you'll be able to download and self-host.
Now, the launch claims are loud. Beating GPT-5.5 and Gemini 3.1 Pro on key benchmarks. Approaching Opus 4.7 on coding. A promotional price that's a rounding error next to proprietary frontier rates. Some of those numbers are independently traceable, some of them are MiniMax's own benchmarks, and a few deserve the kind of skepticism I bring to every model launch. So this is a first look — what M3 actually claims, what I can verify, where I'd push back, and whether it deserves a slot in your stack. Let me walk you through it.
Why an Open-Weight Frontier Model Matters Right Now
The timing is the whole story here, so let me anchor it.
For the last several months, every interesting model drop has followed one of two scripts. Either it's a closed US frontier model — Opus 4.7, GPT-5.5, Gemini 3.1 Pro — where you rent intelligence by the token and never touch the weights. Or it's a capable open release from a Chinese lab that's good but clearly positioned a tier below the frontier. I've reviewed plenty of the second kind: my hands-on with DeepSeek V4-Pro and my breakdown of Kimi K2.6 both landed in that "genuinely useful, not quite frontier" bucket.
MiniMax M3 is positioned to be the first one that doesn't accept that ceiling. According to MiniMax, it's the first and only open-weight model to combine three things at once: frontier-level coding, a 1M-token context window, and native multimodality. Each of those exists separately in other open models. Bringing all three together in one downloadable checkpoint is the actual headline — not any single benchmark number.
Here's why you should care even if you're perfectly happy paying for Opus. Open weights change the economics and the control surface. You can run M3 on your own infrastructure, fine-tune it on your domain, audit it, and never send a token to someone else's servers. For anyone building agents that process sensitive data — legal, medical, financial — that's not a nice-to-have, it's the difference between "we can use this" and "compliance said no." I've watched that exact conversation kill promising internal AI projects more than once.
And the price. The launch promotion halves usage fees to $0.30 per million input tokens and $1.20 per million output tokens (down from the standard $0.60 / $2.40), with a $20/month token plan that buys roughly 1.7 billion M3 tokens. VentureBeat framed M3 as delivering competitive benchmark performance "for just 5-10% of the cost" of the proprietary leaders. If that holds up under real workloads, the build-vs-buy math for a lot of teams flips overnight.
But cheap and open mean nothing if the model can't actually do the work. So before I get excited, I need to understand the architecture that's supposed to make all this possible — because that's where the launch gets genuinely interesting.
What Is MiniMax Sparse Attention (MSA)?
MiniMax Sparse Attention (MSA) is the architectural mechanism that lets M3 process a one-million-token context affordably by replacing full attention with selective KV-block attention — computing attention only over the blocks that matter instead of every token against every other token.
That's the one-sentence version. Here's why it's the load-bearing wall of this entire release.
A quick aside on the name, because the auto-generated transcripts floating around have mangled it badly. I've seen it written as "Multi-Scale Attention" and "Miniax sparse attention." The correct term, per MiniMax's own technical material, is MiniMax Sparse Attention (MSA). Same family of idea as the sparse and lightning-attention work MiniMax has shipped in prior generations, refined for M3.
Standard transformer attention has a brutal property: cost scales roughly with the square of the sequence length. Double your context, quadruple your compute. That's why million-token context windows have historically been either eye-wateringly expensive or quietly degraded — the model technically accepts a long input but stops paying real attention to most of it. You've probably felt this. You paste a huge document, ask a question about page 40, and the model confidently answers based on page 2.
MSA attacks that directly. Instead of every token attending to every other token, it selects the relevant KV blocks and computes attention over those. The reported payoff is dramatic: MiniMax says MSA delivers roughly 15.6x faster decoding and 9.7x faster prefill versus the previous M2 generation at million-token contexts, and brings the cost at 1M tokens down to something like one-twentieth of the prior generation. The decoder coverage published with the launch describes the long-context tier kicking in above 512,000 tokens.
I want to be honest about my confidence level here. The direction of these claims is credible — selective attention is a well-established way to break the quadratic curve, and multiple labs are converging on variants of it. The exact multipliers are MiniMax's own measurements, and I haven't independently profiled them. So treat "9.7x prefill" as a vendor benchmark, not a law of physics. What I can say is that the architecture is the right shape for the problem, and the engineering story is internally consistent.
There's a second architectural decision that matters just as much, and it's the one I think people will underrate.
Native multimodality is not the same as bolted-on vision
M3 was trained on text and visual data from step zero — natively multimodal — rather than taking a strong text model and grafting a vision encoder onto it afterward.
This distinction sounds academic until you see the difference in practice. Bolt-on vision models tend to treat images as a separate sense that gets translated into text-ish tokens and reasoned about at arm's length. Natively multimodal models build a shared representation where visual and textual understanding are entangled from the start. The launch demo that drove this home for me was a form-filling task: M3 was given a blank form image and a set of data points, and it placed every value in the correct field with correct spacing and character positioning — reasoning step by step through coordinates, field placement, and layout.
That's not "read the text in the image." That's spatial reasoning over a visual layout. And MiniMax reports M3 hitting 70.06% on OSWorld-Verified, a computer-use benchmark — the kind of result you only get when visual and action reasoning are tightly coupled.
So the architecture promises frontier reasoning at long context, cheaply, with vision baked in. Bold. Now let's see whether the benchmarks back the architecture — and this is where I get more cautious.
The Benchmarks: What's Real, What's Vendor-Reported, and Where I'd Push Back
Let me put the headline numbers on the table first, then we'll interrogate them. Every figure below is from MiniMax's launch or widely-reported coverage of it — I'll flag what's been corroborated by third parties versus what's purely first-party.
| Benchmark | MiniMax M3 (claimed) | What it measures | Context |
|---|---|---|---|
| SWE-bench Pro | 59.0% | Autonomous software-engineering tasks | Reported ahead of GPT-5.5 (~58.6%); behind Opus on coding |
| Terminal-Bench 2.1 | 66.0% | Terminal/agent task completion | Strong agentic result |
| SWE-fficiency | 34.8% | Efficiency of code changes | Mid-tier, honestly |
| KernelBench Hard | 28.8% | Low-level GPU kernel generation | The hard one — note the absolute number |
| MCP Atlas | 74.2% | Tool-use via Model Context Protocol | Strong tool orchestration |
| BrowseComp | 83.5 | Web browsing / research agent | Top-tier browsing |
| OSWorld-Verified | 70.06% | Computer use (vision + action) | Backs the native-multimodal claim |
| SVG-Bench | Surpasses Opus 4.7 | SVG generation quality | First-party comparison |
Now the honest read.
The single most-cited result is SWE-bench Pro at 59.0%, which puts M3 narrowly ahead of GPT-5.5's roughly 58.6% and ahead of Gemini 3.1 Pro on that specific benchmark. That's the number doing the heavy PR lifting, and it's the one most worth your skepticism — not because it's fabricated, but because a single-benchmark lead of half a percentage point is well within the noise of how these evals are run, scaffolded, and reported. An open-weight model landing in the same cluster as GPT-5.5 on a real agentic coding benchmark is the genuinely impressive fact. "Beats GPT-5.5" as a headline oversells a statistical tie.
Where the framing matters most: MiniMax does not claim to beat Opus on coding. The reporting I've seen has Opus 4.8 leading coding at around 69.2% on SWE-bench Pro versus M3's 59.0%. So the accurate statement is "M3 approaches Opus-tier and trades blows with GPT-5.5 and Gemini 3.1 Pro" — not "M3 is the new king." I've compared the proprietary frontier in detail in my Opus 4.7 vs GPT-5.4 vs Gemini 3 Pro breakdown, and the gaps at the very top are small but real.
One number I want you to read correctly: KernelBench Hard at 28.8%. Out of context that looks low, and people will dunk on it. But KernelBench Hard is brutal — it asks a model to write performant, correct GPU kernels, a task most models score in the single digits or low teens on. 28.8% on the hard split is actually a strong showing for an open model, and it's directly relevant to that 24-hour CUDA kernel story I opened with. Absolute numbers without the benchmark's difficulty baseline are how launch posts mislead you.
The benchmarks where M3's open-weight status makes the result genuinely surprising are the breadth ones — BrowseComp, SVG-Bench, KernelBench Hard, MCP Atlas, and the document-understanding evals — where an open model is reportedly matching or beating proprietary rivals across categories, not just on one cherry-picked metric. Breadth is harder to game than a single number. That's the part of this launch I take most seriously.
If you want help separating signal from noise on releases like this, that's exactly the kind of evaluation I do for clients — testing models against real workloads instead of trusting the launch slides. You can see the kind of builds I take on at fiverr.com/s/EgxYmWD.
That's the leaderboard view. But benchmarks are abstractions. The reason I'm paying attention to M3 is the two long-horizon autonomy demos — because those are much harder to fake than a leaderboard row.
The 24-Hour Kernel and the Self-Training Test: Long-Horizon Autonomy
Here's the demo that made me write this post, told properly.
MiniMax handed M3 a task most senior engineers would dread: optimize an FP8 GEMM (matrix-multiply) kernel on NVIDIA Hopper GPUs. The catch — M3 got only a task description, a benchmark evaluation script, and a non-functional Triton skeleton. No reference implementation. No starter code that ran. It had to make the thing work and make it fast, from almost nothing.
Then they let it run.
Over roughly 24 hours, M3 made 147 benchmark submissions and 1,959 tool calls, working through baseline implementation, autotuning, bottleneck diagnosis, CUDA Graph integration, persistent-kernel rewriting, and host-side scheduling. Hardware peak utilization climbed from 7.6% to 71.3% — a 9.4x speedup. The detail I find most telling: MiniMax reports that most other models stopped making new progress within their first 30 submissions; only Opus 4.7 and M3 kept finding improvements past that point.
That last point is the real signal. Lots of models can take one good swing at a problem. Very few can sustain a campaign — diagnosing why attempt 89 plateaued and what to try for attempt 90 — without spiraling into repetition or hallucinating progress that isn't there. Long-horizon coherence is the capability that separates a chatbot from an agent, and it's the thing I test hardest in my own work. I dug into why sustained autonomy is so hard in my MiniMax M2.7 review, where the previous generation's self-evolution was the headline.
The second autonomy test is, if anything, more audacious. On a "Post-Train Bench" that measures whether a model can run the full machine-learning loop itself — synthesize training data, train a model, evaluate it, iterate — M3 was given four base models that had only completed pretraining and ran the whole data-synthesis-to-iteration cycle over about 12 hours with no human intervention. It reportedly ranked third, behind only Opus 4.7 and GPT-5.5, ahead of every other model tested.
A model that can autonomously improve other models, ranking among the top three in the world at it, while being open-weight, is the kind of sentence that would've read as science fiction eighteen months ago.
My honest caveat, same as always: these are MiniMax's demonstrations, run by MiniMax, reported by MiniMax. They're not peer-reviewed and they're not adversarial. The numbers could be best-case runs cherry-picked from many attempts. But — and this matters — the structure of these tests is hard to fake convincingly, because the artifacts (a working, fast CUDA kernel; trained model checkpoints) are verifiable end products, not just scores. I'd want to reproduce them before betting a production system on M3's autonomy. The direction, though, is unmistakable.
Benchmarks and autonomy demos are one thing. What I actually care about as a builder is whether the thing writes good code I'd ship. So let's look at the build tests.
How Does MiniMax M3 Handle Real Front-End and Creative Coding?
MiniMax M3 produces production-leaning front-end output — clean component structure, multiple typography systems, and working animations — and in the launch comparisons it outperformed Qwen's latest "Max" model and produced fewer bugs than Gemini Flash on the same prompts.
A naming correction first, since the transcripts garble it: the comparison model is Qwen (Alibaba's flagship, the proprietary "Max" tier as of its mid-2026 release), not "Quen 3.7." And the lighter Google model is the Gemini Flash line — I tested that family separately in my Gemini 3.5 Flash hands-on. Getting the comparison set right matters, because "beats Qwen Max" and "beats a small Flash model" are very different claims.
Here's what the build demos actually showed, and how I read each one.
The landing page test. Given a prompt for a landing page with color blocking and a variable color system, M3 produced a clean, well-structured design with dynamic interactions — and in the head-to-head, Gemini's output was buggier. This tracks with my general experience: the gap between models on UI work isn't usually "can it center a div," it's "does the spacing system stay consistent across components and does the interactivity actually work." M3 reportedly held both. That's the production-readiness threshold.
The browser-based Windows 11 clone. This is the one that made me raise an eyebrow. From a single prompt, M3's build included startup sounds and animations, a functional login with PIN entry, working replicas of Notepad and Paint, a Calculator, Command Prompt, a Settings app with volume control — and, unprompted, a 3D trench-run game. The unprompted game is the interesting tell: it suggests the model wasn't just pattern-matching the literal request but elaborating on the spirit of "build a desktop OS." The one miss reported was SVG-coding every app icon. I'll take that trade.
The 3D and physics test. Asked to simulate nine channels on a 1990s concave TV screen, M3 returned precise 3D rendering using 3D Gaussian Splatting (3DGS) — that's the "3GS" the transcript mangled — with UI controls, animations, physics simulation, procedural graphics, and embedded sound. An immersive 3D room, from a text prompt. If you've ever wrestled a model into producing coherent Three.js or WebGL, you know how rare clean physics-aware 3D output is.
The SVG-at-scale test. Three SVG challenges: an animated butterfly (high quality, comparable to Gemini), a PS4 controller (accurate layout and keypad, beating Qwen), and an NYC skyline with day/night transition that ran 2,000+ lines of SVG with animated scene transitions and no filler padding. That last one is the real test. Generating 2,000 lines of meaningful markup without the model giving up, looping, or stuffing the output with repetitive junk is a genuine long-output stress test — and it ties straight back to that MSA long-context architecture.
The through-line across all four: M3 isn't just producing code that compiles, it's producing code with taste — layout discipline, unprompted elaboration, sustained coherence over long outputs. That's the qualitative jump that benchmarks struggle to capture.
So how do you actually get your hands on it? That part's refreshingly simple.
How to Access MiniMax M3 (API, CLI, and OpenRouter)
You can use MiniMax M3 through three main routes today: the MiniMax API directly, the MiniMax coding platform/CLI, and OpenRouter — and the weights are slated to release publicly within about ten days of launch for self-hosting.
Here's the practical breakdown, with the naming cleaned up (the transcript's "M Code," "Open Code," and "Open Router" map to the MiniMax coding platform/CLI and OpenRouter respectively):
-
MiniMax API — Grab an API key from the MiniMax platform and call M3 directly. Pricing during the launch promotion is $0.30/M input and $1.20/M output tokens (half the standard $0.60 / $2.40). Rate limits at launch were reported around 200 RPM and 10M TPM. This is your route for production integrations.
-
MiniMax coding platform / CLI — MiniMax ships its own coding tool, and the launch noted a code platform offering M3 access free. Because the API is OpenAI-compatible, you can also drop your key into tools like Claude Code, Cline, or OpenCode and point them at M3 — the same pattern people have used with prior MiniMax models. If you want the full setup walkthrough for routing third-party coding tools at MiniMax, I covered the workflow in my MiniMax M2.7 review.
-
OpenRouter — M3 is listed on OpenRouter (
minimax/minimax-m3), which is the fastest way to test it against models you already use without managing a second API key. This is where I'd start if you just want to kick the tires for an afternoon. -
Self-hosting (soon) — Once the weights land on Hugging Face and GitHub, you can run M3 on your own infrastructure. This is the option that unlocks the compliance and fine-tuning use cases I mentioned earlier — and the reason the "open-weight" label is more than a marketing word.
A specific cost note worth internalizing: the 1M-token context comes with a tier. MiniMax guarantees a usable minimum of 512,000 tokens at the standard rate; requests above 512K bill at the long-context tier, reported at roughly double the standard per-token rate. So "1M context" is real, but the back half of that window costs more. Budget accordingly — don't architect an agent that casually blows past 512K tokens on every call unless you've done the math.
Pro tip: if you're evaluating M3 for an agent that needs the full million-token window, instrument your token usage before you commit. I've seen long-context agents quietly 4x their cost because nobody noticed the context was ballooning past the cheap tier on every loop. Measure first.
Now — should you actually adopt this? Here's where I separate the hype from what I'd genuinely tell a client.
The Real Talk: Where I'd Trust M3 and Where I Wouldn't
I'll give you the honest version, the one I'd give a friend over coffee rather than the launch-day enthusiasm.
What genuinely impresses me. The combination is the achievement, not any single number. Frontier-adjacent coding plus 1M native-multimodal context plus open weights plus a price in the single-digit-percent range of proprietary frontier models — that bundle didn't exist before June 1, 2026. For a solo founder or a small team that's been priced out of running serious agents on Opus, M3 changes what's affordable. The breadth across benchmarks (not just the headline SWE-bench number) and the long-horizon autonomy demos are the parts I weight most heavily, because they're the hardest to fake.
Where I'd pump the brakes. Every performance number above is MiniMax's own, run under MiniMax's conditions. The "beats GPT-5.5" framing rests on a half-point lead that's statistically a tie. M3 does not beat Opus on coding, and anyone telling you it's "the new frontier king" is selling something. Vendor benchmarks have a long history of not surviving contact with independent reproduction — I've watched plenty of launch-day leaderboard toppers settle into "very good, not best" once the community ran them adversarially. Until third parties profile M3 on their own harnesses, I'm treating these as promising, not proven.
The trade-off nobody on launch day mentions. Open weights are a gift and a responsibility. Self-hosting a 1M-context multimodal model is not a weekend project — you need real GPU infrastructure, and the long-context tier is genuinely expensive on the back half of the window. The "free" and "cheap" framing applies cleanly to the API tier and small contexts. Push into million-token agent loops and the costs are real. Don't let "open-weight and cheap" lull you into architecting something your budget can't sustain at scale.
My prediction. I think M3 is the start of a pattern, not a one-off. The gap between open and closed frontier models has been shrinking for a year, and M3 is the first release where I'd say the gap at the top is now a matter of months, not generations — at least on coding and agentic tasks. By the end of 2026 I expect "use an open model for 90% of agent work, fall back to a closed frontier model for the hardest 10%" to be a completely mainstream architecture. M3 makes that architecture viable today.
Here's the uncomfortable question that hangs over the whole proprietary-frontier business model: if an open-weight model gets you 90% of the way at 5-10% of the cost, what exactly are you paying the other 90% for? For some workloads the answer is "the last 10% of reliability, and it's worth it." For a lot of workloads, it suddenly isn't.
So what does adopting M3 actually look like in practice, and how would you know it's working? Let me ground it.
What to Expect If You Actually Adopt M3
Realistic expectations, based on the mechanism rather than invented metrics.
If you're currently running agent workloads on a proprietary frontier model and you swap the bulk of them to M3, the cost mechanism is straightforward: at $0.30/$1.20 per million tokens (promo) versus proprietary frontier rates that run many multiples higher, your per-task spend on routine agent work drops substantially — VentureBeat's "5-10% of the cost" framing is the order of magnitude to plan around for comparable benchmark performance. The honest caveat is that the savings shrink once you push past the 512K long-context tier, so the biggest wins are on short-to-medium context tasks at high volume.
What to actually measure once you're testing:
- Task completion rate on your real workloads, not benchmarks. Run M3 and your current model on the same 20 real tasks and compare. This is the only number that matters.
- Long-horizon stability. For multi-step agents, watch how many steps M3 sustains before it loses the thread or starts repeating. The kernel demo suggests this is a strength — verify it on your tasks.
- Hallucination rate on your domain. Native multimodality and long context don't automatically fix fabrication. Spot-check outputs against ground truth.
- Cost per completed task (not per token). A cheaper model that needs three retries isn't cheaper.
Quick wins you can expect in the first afternoon: front-end and SVG generation that lands closer to production-ready than most open models, and dramatically lower cost on high-volume, short-context agent loops. The longer-term payoff — self-hosting for compliance, fine-tuning on your domain — arrives once the weights drop and you've stood up the infrastructure.
Don't expect: a free lunch on 1M-context workloads, or independently-verified frontier supremacy. Expect a genuinely strong, genuinely open model that's good enough to be the default for most of your agent work, with a closed frontier model as your fallback for the hardest tasks.
The Bottom Line on MiniMax M3
Go back to that opening image: a model alone in the dark for 24 hours, submission after submission, pushing a dead kernel skeleton from 7.6% to 71.3% utilization with nobody watching. The thing that makes that story matter isn't the speedup. It's that the model which did it is one you'll soon be able to download, inspect, fine-tune, and run on your own machines — for a price that makes the proprietary frontier's economics look suddenly fragile.
MiniMax M3 isn't the best model in the world. Opus still leads on coding, the headline benchmark wins are statistical ties, and every number here deserves the skepticism I'd bring to any launch. But "best in the world" was never the point. The point is that frontier-adjacent capability, native multimodality, a million-token context, and open weights now arrive in a single release at a fraction of the cost — and that combination didn't exist a week ago.
If you build agents, here's your next 24 hours: pull up M3 on OpenRouter, take the exact same five tasks you ran on your current model last week, and run them side by side. Don't trust my read, don't trust MiniMax's slides. Run your own gauntlet. Then come tell me whether the open frontier just arrived — because from where I'm sitting, it looks like it did.
FAQ
Frequently Asked Questions
Everything you need to know about this topic
MiniMax M3 is open-weight, with the weights slated to release publicly on Hugging Face and GitHub within about ten days of the June 1, 2026 launch. The API isn't free — it runs $0.30/M input and $1.20/M output during the launch promotion — but a code platform offers M3 access free, and self-hosting becomes possible once weights drop.
MiniMax M3 reportedly edges GPT-5.5 on SWE-bench Pro (59.0% vs ~58.6%), but that's a statistical tie, not a clear win. It does not beat Opus on coding — Opus leads at roughly 69.2% on the same benchmark. The accurate framing is "M3 approaches the proprietary frontier," not "M3 is the new king."
MiniMax M3 supports up to a 1,048,576-token (1 million) context window, with a guaranteed usable minimum of 512,000 tokens at the standard rate. Requests above 512K tokens bill at a long-context tier roughly double the standard per-token price, so the back half of the window costs more. See the access section above for cost planning.
You can access MiniMax M3 through the MiniMax API, the MiniMax coding platform/CLI, and OpenRouter (minimax/minimax-m3), with self-hosting available once the open weights release. OpenRouter is the fastest way to test it against models you already use. For the full setup walkthrough, see the access section above.
In MiniMax's launch comparisons, M3 produced cleaner, more production-leaning front-end output than Qwen's latest Max model and fewer bugs than Gemini Flash on the same prompts — strong layout discipline, working animations, and coherent component structure. Verify it on your own UI tasks before adopting it as a default.
Let's Work Together
Looking to build AI systems, automate workflows, or scale your tech infrastructure? I'd love to help.
- Fiverr (custom builds & integrations): fiverr.com/s/EgxYmWD
- Portfolio: mejba.me
- Ramlit Limited (enterprise solutions): ramlit.com
- ColorPark (design & branding): colorpark.io
- xCyberSecurity (security services): xcybersecurity.io