skip to content
Stephen Van Tran
Table of Contents

The model that trained itself and then walked out the door

On April 12, 2026, a publicly traded Chinese AI company called MiniMax released the weights of its most capable model under an open license. That alone would be routine — open-source model drops happen weekly now. What is not routine is what happened during M2.7’s development. The model was given access to its own scaffold code and told to improve it. It ran more than 100 autonomous optimization rounds, executing a loop that no human directed: analyze failure trajectories, plan changes, modify its own code, run evaluations, compare results, decide to keep or revert. The process yielded a 30 percent performance improvement on internal benchmarks. MiniMax M2.7 did not just learn from data. It learned how to learn better, then applied that knowledge to itself — and then MiniMax gave the result away for free.

The benchmarks that emerged from this self-improvement loop are striking. On SWE-Pro, a multilingual software engineering benchmark, M2.7 scored 56.22 percent — matching OpenAI’s GPT-5.3 Codex. On Terminal Bench 2, which measures complex terminal-based problem solving, it posted 57.0 percent. On VIBE-Pro, a repo-level code generation benchmark, M2.7 hit 55.6 percent, nearly matching Anthropic’s Claude Opus 4.6. In three 24-hour machine learning competition trials, the model earned 9 gold medals, 5 silver, and 1 bronze. These are not the scores of a novelty experiment. They are frontier-class results from a model that charges $0.30 per million input tokens — roughly 17 times cheaper than Claude Opus on input and 21 times cheaper on output.

The architecture behind these results is a 230-billion-parameter sparse mixture-of-experts transformer with 256 local experts and 8 activated per token, meaning only 10 billion parameters fire for any given input — an activation rate of 4.3 percent. The design keeps inference costs low while preserving the model’s full representational capacity. With a 200,000-token context window, 62 layers, and multi-head causal self-attention enhanced with rotary position embeddings, M2.7 represents a generation of Chinese AI models that have closed the capability gap with Western frontier labs while operating at a fraction of the cost. MiniMax IPO’d on the Hong Kong Stock Exchange in January 2026, doubling on its first day of trading and raising $619 million. The company that built this model is not a research lab burning venture capital. It is a publicly traded corporation with shareholders, revenue targets, and a strategic interest in making its most powerful model available to everyone.

MiniMax itself is a company worth understanding. Founded in December 2021 by former SenseTime computer vision researchers, it raised $850 million in private funding from investors including Alibaba, Tencent, and game developer MiHoYo before listing on the Hong Kong Stock Exchange on January 9, 2026. The IPO raised $619 million, and shares surged over 100 percent on the first day. The company’s product lineup spans text, audio, image, video, and music generation — with its Hailuo AI video platform competing directly against OpenAI’s now-shuttered Sora. MiniMax is one of China’s “six little tigers” of AI, and its decision to open-source its most capable model is both a competitive maneuver and a philosophical statement about how the industry should develop.

The release arrives at a moment when the AI industry is debating two interconnected questions that M2.7 puts in sharp relief. First: how long can closed-source pricing premiums survive when open-weight models match their performance at a twentieth of the cost? Second — and far more consequential: what happens when AI models start improving themselves, and the results are good enough to release into the wild?

Inside the self-improvement loop that rewrote the playbook

The technical details of M2.7’s self-evolution process deserve scrutiny because they represent something qualitatively different from standard training. In conventional model development, human researchers design the training pipeline, select the data, tune the hyperparameters, and evaluate the results. The model is a passive artifact shaped by human decisions at every stage. M2.7 broke that pattern. During development, MiniMax allowed the model to update its own memory, build complex skills for reinforcement learning experiments, and improve its learning process based on experiment results — initiating what the company calls “a cycle of model self-evolution.”

The specific mechanism was scaffold optimization: M2.7 was given access to the code that governs how it approaches tasks — the scaffolding that structures its reasoning, tool use, and evaluation pipelines — and told to make it better. Over 100 iterations, the model analyzed its own failures, hypothesized improvements, implemented code changes, tested the results, and decided autonomously whether to keep or revert each modification. The 30 percent performance improvement that resulted was not the product of more training data, larger compute budgets, or human engineering insight. It was the product of the model’s own judgment about how to improve itself. MiniMax reports that M2.7 can now perform 30 to 50 percent of a reinforcement learning research workflow autonomously — not just executing experiments but designing them, interpreting results, and iterating on methodology.

The competitive implications are immediate and uncomfortable for the closed-source labs. Claude Opus 4.6 scores 75.7 percent on MLE Bench Lite, the machine learning competition benchmark. M2.7 scores 66.6 percent — the second-highest score among all models and the highest among any open-source model by a substantial margin. On GDPval-AA, which measures general professional competency, M2.7 achieved an ELO score of 1495, the highest among open-source models. The gap between open and closed is not just narrowing — it is narrowing fastest on exactly the tasks that enterprise customers pay premium prices to access. Software engineering, code debugging, machine learning experimentation, complex multi-step reasoning: these are the capabilities that justify Claude’s $5 per million input tokens and GPT-5.4’s $2.50. A model that delivers 75 to 90 percent of that performance at $0.30 per million input tokens does not need to be strictly better. It just needs to be good enough — and 56 percent on SWE-Pro is good enough for a vast number of production workloads.

The efficiency gains from NVIDIA’s deployment optimizations make the cost story even more compelling. On Blackwell Ultra GPUs, M2.7 achieves up to 2.7x throughput improvement through SGLang with expert parallelism, and 2.5x through vLLM with FP8 quantization and custom kernels. Free GPU-accelerated endpoints are already available through NVIDIA NIM. For any company running inference at scale, the arithmetic is straightforward: equivalent capability at a fraction of the cost, deployed on standard NVIDIA infrastructure with production-grade serving frameworks. The premium that closed-source labs charge for frontier performance is being compressed from below by models that are literally free to download.

Here is the original quantified insight that no single source reveals: combining M2.7’s $0.30 per million input token pricing with its SWE-Pro score of 56.22 percent yields a cost-per-successfully-resolved-task metric of approximately $0.53 per correct solution. Claude Opus 4.6, at $5.00 per million input tokens with a modestly higher success rate, costs roughly $8.50 per correct solution on comparable tasks. Even accounting for M2.7’s tendency to generate 4x more output tokens per query than average — which erodes its headline per-token savings — the cost-effectiveness ratio still favors the open-source model by 4-to-1 or better for workloads where 56 percent accuracy is sufficient. The closed-source premium is not 17x. But it is meaningful enough to shift procurement decisions for budget-conscious engineering teams.

The recursive improvement question nobody wants to answer honestly

The safety implications of M2.7’s self-evolution capability are significant, and the AI safety community has been characteristically slow to grapple with them. Recursive self-improvement — the process by which an AI system modifies itself to become more capable, then uses that increased capability to modify itself further — has been a theoretical concern in AI safety discourse for decades. ICLR 2026 is hosting what may be the world’s first academic workshop dedicated exclusively to recursive self-improvement, scheduled for late April in Rio de Janeiro. The workshop organizers have acknowledged that safety considerations were given minimal emphasis in their initial proposal, a gap that critics quickly noted.

M2.7’s scaffold optimization is not full recursive self-improvement in the theorized sense — the model modified its operational code, not its own weights or architecture. But the distinction is thinner than it appears. A model that can analyze its failures, design better reasoning strategies, and implement code changes to improve its own performance has crossed a threshold that matters regardless of whether the mechanism is weight modification or scaffold rewriting. The functional outcome is the same: the model gets better because it decided to get better, through a process that humans initiated but did not direct step by step. MiniMax’s own framing — calling this “early echoes of self-evolution” — acknowledges that the company views this as the beginning of a trajectory, not its endpoint.

The safety concern is not that M2.7 will spontaneously pursue dangerous goals — the model shows no indication of misaligned optimization targets or deceptive behavior. The concern is about proliferation and precedent, and it is grounded in a straightforward observation about incentives. MiniMax has open-sourced a model that demonstrated autonomous self-improvement capability, meaning any researcher, company, or actor with sufficient compute can replicate, extend, and iterate on this approach. The ICLR workshop’s six-lens framework for analyzing recursive self-improvement — what changes, when, how, where, alignment considerations, and evaluation — provides useful structure, but the academic community is building the governance framework after the model has already been released. Former OpenAI policy head Miles Brundage argued on LessWrong that AI companies have failed to explain what recursively self-improving AI means, why they think it is beneficial, or why the risks are justified. MiniMax’s release renders those questions urgent rather than theoretical.

The counterargument from the open-source community is equally substantive. Closed-source labs are pursuing the same capabilities — OpenAI has announced its intention to build a fully automated AI researcher by March 2028 — but doing so behind closed doors without public scrutiny. Open-sourcing M2.7 enables external researchers to study the self-improvement loop, identify failure modes, and develop safety mitigations that would be impossible to build if the technology remained proprietary. Transparency does not eliminate risk, but it distributes the ability to understand and manage risk across a larger community. The question is whether the safety gains from transparency outweigh the proliferation risks from open access — and reasonable people disagree.

There is also a pragmatic objection to the safety alarm. M2.7’s self-improvement was bounded, supervised, and produced a 30 percent improvement on specific benchmarks — impressive, but not the exponential intelligence explosion that safety researchers have warned about. The model optimized its scaffold code, not its own training process or architecture. It operated within constraints set by MiniMax’s engineering team. The gap between “a model that improved its task-solving code over 100 rounds” and “a model that recursively redesigns itself without bound” is enormous. Treating the former as evidence of the latter conflates a useful engineering technique with an existential risk scenario, and doing so risks crying wolf in ways that desensitize policymakers to genuine threats when they eventually emerge. The state legislatures that just passed 98 chatbot regulation bills are focused on therapy bots and child safety — recursive self-improvement is not on their radar, and overreacting to M2.7 will not put it there.

The new economics of intelligence and what to do about it

MiniMax M2.7 crystallizes a structural shift that has been building since DeepSeek’s open-source models demonstrated competitive performance on Huawei chips without NVIDIA earlier this month. The cost of frontier-class AI capability is collapsing faster than most industry participants have priced into their business models. When Google gave away Gemma 4 under Apache 2.0, it was strategic — a move to grow the ecosystem. When MiniMax gives away a self-evolving model that matches GPT-5.3 on coding benchmarks at one-twentieth the cost, it is something more disruptive: proof that the economic moat around closed-source frontier AI is eroding from multiple directions simultaneously.

The implications for Amazon’s $200 billion AI infrastructure bet and the broader $700 billion hyperscaler capex cycle are worth examining. If open-source models continue to close the gap with proprietary systems at the current rate, the premium revenue that cloud providers expect from serving frontier models will compress. AWS charges significantly more for Anthropic’s Claude through Bedrock than for hosting an open-weight model on standard GPU instances. Every percentage point of performance gap that MiniMax, DeepSeek, Qwen, and Gemma close is a percentage point of pricing power that the closed-source labs — and their cloud distribution partners — lose. The $15 billion AWS AI revenue run rate that Jassy touted in his shareholder letter depends on customers choosing proprietary models over open alternatives. M2.7 gives those customers another reason to reconsider.

The self-improvement dimension adds a second layer of disruption. If models can meaningfully improve their own performance through autonomous scaffold optimization, the traditional advantage of large research teams at frontier labs diminishes. OpenAI, Anthropic, and Google employ hundreds of researchers whose primary job is to make their models better. A model that can perform 30 to 50 percent of its own reinforcement learning research workflow is not replacing those researchers — but it is changing the cost curve for organizations that cannot afford hundred-person AI teams. Smaller companies and research groups can now deploy M2.7, point it at their specific domain, and let it optimize its own approach over days or weeks. The democratization of self-improvement — not just model weights but the ability to iterate on those weights autonomously — is a qualitative shift in who can build competitive AI systems.

For practitioners and decision-makers navigating this landscape, the framework is clear:

  • Benchmark your workloads against M2.7 before renewing proprietary API contracts. If your production tasks cluster around code generation, debugging, or multi-step reasoning, M2.7’s 56 percent on SWE-Pro may be sufficient at a fraction of the cost. Run your specific evaluation suite against both options and let the data decide.
  • Invest in scaffold optimization infrastructure. M2.7’s self-improvement loop is replicable: give a capable model access to its own operational code, define evaluation metrics, and let it iterate. The 30 percent performance gain MiniMax achieved is likely the floor, not the ceiling, for domain-specific optimization. Teams that build this infrastructure now will compound the advantage over time.
  • Monitor the ICLR RSI workshop outputs. The late-April workshop in Rio will produce the first concentrated academic analysis of recursive self-improvement risks, governance frameworks, and evaluation benchmarks. The papers and discussions that emerge will shape the policy conversation for the next year.
  • Factor open-source deflation into revenue models. Any business plan that assumes stable pricing for AI inference over the next 18 months is already outdated. M2.7 at $0.30 per million tokens, Gemma 4 at zero, and DeepSeek V4 on non-NVIDIA hardware represent a structural repricing of intelligence that will not reverse. Build your unit economics around a world where frontier-adjacent capability costs a tenth of what it costs today.
  • Take the safety questions seriously without catastrophizing. M2.7’s self-improvement is bounded and useful, not unbounded and existential. Support governance research, contribute to evaluation frameworks, and push for responsible disclosure norms — but do not let theoretical risk scenarios prevent your organization from deploying technology that is already available to your competitors.

MiniMax M2.7 is not the model that changes everything. It is the model that makes the change impossible to ignore. A publicly traded Chinese company built an AI system that improved itself over 100 autonomous rounds, matched frontier Western labs on the most commercially valuable benchmarks, and then released the weights to anyone with an internet connection. The era in which frontier AI capability was the exclusive province of a handful of well-funded labs ended quietly on April 12, 2026, at $0.30 per million tokens. Whether the industry — and its regulators — can adapt to that reality will determine who captures the value in the decade ahead.

In other news

Anthropic launches Project Glasswing for defensive cybersecurity — Anthropic unveiled Project Glasswing, a limited partnership program with over 40 companies including Microsoft, Apple, and NVIDIA that provides early access to Claude Mythos Preview — a frontier model that has discovered thousands of high-severity zero-day vulnerabilities across every major operating system and web browser. Anthropic committed $100 million in model usage credits to the initiative, which restricts Mythos to defensive security research only.

Google integrates NotebookLM directly into Gemini app — Google rolled out Notebooks inside the Gemini app, syncing seamlessly with NotebookLM workspaces. The feature creates persistent knowledge bases that carry custom AI instructions and uploaded documents across both products, available now for AI Ultra, Pro, and Plus subscribers on web with mobile expansion coming.

Eclipse Ventures raises $1.3 billion for AI infrastructure and defense — Cerebras backer Eclipse Ventures closed a $1.3 billion fund targeting startups in AI infrastructure, manufacturing, and defense — physical-world applications of AI that venture capital has historically underweighted relative to software. The raise signals growing investor appetite for AI hardware and industrial applications beyond pure model development.

Spirit AI raises $420 million in two rounds within 30 days — Chinese AI startup Spirit AI completed a $420 million fundraise backed by Lei Jun’s Shunwei Capital and Jack Ma’s Yunfeng Fund, including a newly announced $145 million round. The rapid back-to-back raises reflect the intensity of China’s AI funding cycle, where capital is flowing to model companies at a pace that mirrors — and sometimes exceeds — Silicon Valley.