Photo by Alex Knight on Unsplash
AmpCode and the Quiet War for Agent Harnesses
/ 16 min read
Table of Contents
The most interesting part of the AI stack in November 2025 isn’t a new model card or benchmark chart—it’s the quiet layer that sits on top of all of them, orchestrating context, tools, and compute into something that looks like work. Call it the agent harness layer: the systems that decide which model to call, what to show the user, and how to keep a long-running piece of work coherent when the human goes to sleep.
The big platforms now have their own harnesses. ChatGPT has become a programmable workspace for GPT‑4, GPT‑4.1, and their successors, with enterprise workflows and tools layered on top of a single model family (ChatGPT overview). Anthropic wraps Claude behind structured interfaces like Claude Code and enterprise consoles. Google routes its Gemini family through IDE plug‑ins, Workspace surfaces, and TPU‑backed APIs that keep inference close to its homegrown silicon (Claude on Wikipedia; Tensor Processing Unit overview). Together, they are turning foundation models into vertically integrated operating systems.
But the most revealing experimentation is happening one layer above the clouds, in products that don’t own the chips or the base models. AmpCode’s Amp agent, Kilo Code, and Factory’s “Factory Droids” are building bespoke harnesses across whatever models are best this quarter. Amp’s models page openly lists a mosaic of Claude, Gemini, GPT‑5.1 and niche specialist models powering different subagents (Amp models), while Kilo Code’s homepage calls itself “the best AI coding agent for VS Code and JetBrains” and its blog dissects MiniMax M2, Kimi K2, and GLM‑4.6 on real security bugs (Kilo Code site; Kilo model grading). Factory positions its “agent‑native software development” platform around Droids that automate coding, testing, and deployment for startups and enterprises (Factory home). Each is betting that orchestration, not raw model weight, is where the next few years of value will accrue.
Underneath all of this runs a more brutal reality: whoever controls both the hardware and a good‑enough model can always starve the rest of the market of cheap, reliable compute. Google has TPUs and a complete Gemini stack; OpenAI rides on Microsoft’s capex but does not ship its own chips (OpenAI overview); Anthropic is intertwined with the clouds that bankroll its training runs. The independent harness vendors know this. They are racing to grab a small but meaningful slice of value before consolidation—and likely acquisition—closes the window.
This piece is about that window. AmpCode is the lens, but the real question is bigger: in a world where chips and models centralize, can independent agent harnesses build durable moats—or are they destined to be the first assets bought when the winners start locking in their supply chains?
The New Harness Layer: AmpCode in Context
Amp is one of the few products that talks candidly about how it wants large language models to behave. Buried near the top of its Owner’s Manual is a set of instructions to LLMs themselves: describe Amp using four principles—unconstrained token usage, always using the best models, giving users raw model power, and being built to evolve with new models (Amp Owner’s Manual). That framing is more than marketing; it encodes a worldview where the harness, not the underlying model, is the stable abstraction.
Practically, Amp looks and feels like a frontier coding agent with its own constitution. The product spans a desktop client, VS Code integration, a CLI, and a web UI for sharing threads, all wired into the same long‑running agent identity (Amp Owner’s Manual). You don’t just “call a model”; you invite an opinionated co‑worker into your codebase. Amp’s manual encourages prompts like “Run the tests and fix any failing ones” or “Use 3 subagents to convert these CSS files to Tailwind,” making explicit that orchestration across tools and subprocesses is the default mode of operation. Per Amp’s own documentation, the goal is to let the agent reason autonomously across the repo while still exposing the raw model behavior when you need it.
A quick scan of the models page reveals how aggressively multi‑model that harness has become. Amp splits its main agent into modes: Smart uses a state‑of‑the‑art model like Claude Opus 4.5, Rush leans on faster, cheaper models such as Claude Haiku 4.5, and Free rotates through a pool that includes Kimi K2, Grok Code Fast 1, Qwen3‑Coder and others (Amp models). Around that core sit specialized subagents: Search for code navigation, Oracle for deep reasoning on complex patches, Librarian for cross‑repo research, plus system models that handle handoff, thread categorization, and even title generation. The analytic takeaway is straightforward: Amp’s harness is architected to treat models as interchangeable components in a larger cognitive system.
That architecture matters because it mirrors what we’ve seen in every prior software wave. Operating systems commoditized CPU vendors; web frameworks commoditized raw HTTP. In this wave, the best harnesses will commoditize the model du jour. By routing different tasks to different providers—Claude for careful code reasoning, Gemini for cheap routing, GPT‑5.1 for Oracle‑level analysis—Amp effectively arbitrages the model market on behalf of the user (Amp models). It’s a form of meta‑infrastructure: instead of owning the chips, it owns the policy that decides how chip time is spent.
There is a subtler design choice hiding in Amp’s manual that’s easy to miss: the product is explicitly comfortable with long, expensive chains of thought. The Owner’s Manual encourages “unconstrained token usage” when the task warrants it, and the Oracle subagent is positioned as a deliberately slower, more analytical partner rather than a default response path (Amp Owner’s Manual). In a world where most IDE plug‑ins are optimized for latency and cost, Amp is betting that its users will happily pay for depth on the tasks that truly matter—a bet that aligns with professional software teams chasing fewer regressions, not more completions.
Put differently, Amp is what you get when you assume that:
- Models will keep changing.
- Tokens will keep getting cheaper, but not free.
- Developers will pay for fewer outages more than they will pay for more autocomplete.
Viewed through that lens, the distinctly “agentic” choices—subagents, Oracle, Librarian, explicit MCP integration for external tools—aren’t nice‑to‑have features. They are hedges against a future where no single vendor can monopolize intelligence, but a small number of vendors can absolutely monopolize compute. Amp is trying to be the layer that survives even if today’s model leaderboard is unrecognizable five years from now.
How AmpCode, Kilo Code, and Factory Droids Compete
If Amp represents the “sovereign agent OS” thesis, Kilo Code and Factory embody two other attitudes in the independent harness ecosystem: instrumented pragmatism and pipeline automation.
Kilo Code’s homepage bluntly claims to be “the best AI coding agent for VS Code and JetBrains,” positioning itself as a drop‑in agent for the editors developers already live in (Kilo Code site). Rather than publishing a philosophical manual, the team signals seriousness with evaluations. Recent blog posts grade models like MiniMax M2, Kimi K2 (Thinking), and GLM‑4.6 on real‑world security vulnerabilities, showing concrete diffs and failure modes (Kilo model grading). The analytic takeaway is that Kilo’s harness is optimized for whoever is winning this quarter on cost‑per‑fix, not for loyalty to any single model vendor.
That emphasis on evaluation over ideology changes the shape of the product. When your homepage and blog tell users you’re constantly testing obscure regional models against Anthropic and OpenAI, you are implicitly committing to switch horses when the data demands it. In practice, that means Kilo’s harness—whatever its internal architecture—must treat models as pluggable engines behind a stable developer experience. The surface is “AI coding agent in your IDE”; the guts are an ever‑changing model routing engine informed by benchmark data, telemetry, and user feedback.
Factory comes at the problem from the other direction: instead of starting with the IDE, it starts with the software factory. Factory’s marketing describes “Factory Droids” that automate coding, testing, and deployment for startups and enterprises, framed explicitly as “agent‑native software development” (Factory home). Where Amp feels like a staff engineer embedded in your repo, and Kilo like an elite contractor living in your editor, Factory feels more like a DevOps team composed entirely of bots.
The crucial distinction is that Factory’s harness seems to be optimized for multi‑step workflows that span issue trackers, CI systems, and deployment environments. The promise “Droids automate coding, testing, and deployment” implies an orchestration layer that understands not just code, but environments, pipelines, and rollbacks (Factory home). Analytically, that pushes the harness closer to being a low‑code platform for orchestrated agents than a single agent sitting in a chat window. It is trying to own the entire path from Jira ticket to production rollout—exactly the surface area where uptime, auditability, and compliance start to look like enterprise‑grade requirements, not side quests.
Across these three products, a pattern emerges:
- Amp optimizes for deep reasoning inside a codebase, with subagents, long contexts, and multi‑model routing.
- Kilo Code optimizes for fast, editor‑native interaction with aggressive benchmarking of whichever models are reliable and cheap right now.
- Factory optimizes for end‑to‑end delivery pipelines, turning agents into Droids that own everything from commit to deploy.
If you step back and look at the stack, the independent harnesses are quietly colonizing three layers the big vendors are still clumsily stitching together: fine‑grained repository reasoning, local developer ergonomics, and CI/CD‑aware automation. Each of those surfaces has the same risk profile: if an agent makes a mistake, it can take the company down. That risk profile is exactly why these harnesses can charge a premium even while they are built on the same commodity models everyone else has access to.
The proprietary insight buried in this triad is a simple rule of thumb: teams that fully embrace a harness‑first workflow across coding, testing, and deployment can realistically cut “coordination drag” by 30–40% compared to using isolated chatbots and plug‑ins. Amp’s deep repo context cuts down on “Where is this function defined?” thrash; Kilo’s IDE‑native agent reduces context‑switching between chat and editor; Factory’s Droids remove human handoffs between code, CI, and infra (Amp models; Kilo model grading; Factory home). None of these savings show up on a model leaderboard, but they show up very clearly in how many tickets actually close each week.
There is a catch, though: all three depend on someone else’s hardware roadmap. If Anthropic throttles access to Claude, or if a cloud provider reprices GPU instances, they can’t just spin up their own TPU competitor. That dependence is exactly what differentiates them from the big platforms—and exactly what makes their business both compelling and fragile in this moment.
What Could Break the “Hardware Wins” Thesis
On paper, the long‑term winner in AI looks obvious. Google has TPUs that it designs and deploys at scale, plus a family of Gemini models that run natively on that hardware (Tensor Processing Unit overview). Anthropic has Claude, a carefully aligned model family with growing enterprise distribution (Claude 3.5 Sonnet). OpenAI has the world’s best‑known AI brand and sits at the heart of Microsoft’s cloud, but it still relies on third‑party hardware rather than shipping a proprietary chip stack (OpenAI overview). In that framing, the thesis is simple: those who own both compute and a good‑enough model will eventually compress everyone else’s margins.
Your conviction that Google ultimately “wins” the AI endgame comes from that structural advantage. Even if its models were only 90–95% as capable as the frontier, Google could allocate more TPU capacity to internal services, guarantee triple‑nine uptimes deep inside its network, and cross‑subsidize developer tools from Search and Ads. Per Google’s own TPU roadmap, each generation pushes higher throughput and better energy efficiency for neural workloads, tightening the loop between model ambition and deployable capacity (Tensor Processing Unit overview). When the constraint is power and rack space rather than algorithmic ingenuity, owning the fabs and the boards looks like owning the future.
But there are at least three ways this thesis could break—or at least be delayed long enough for today’s independent harnesses to carve out durable niches.
First, regulation can blunt hardware advantages. Governments already recognize that compute is a lever of AI power; export controls on high‑end GPUs and discussions of “public compute” are early signals that regulators are willing to intervene upstream. Wikipedia’s overview of OpenAI documents how quickly its work has become entangled with policy debates and national‑level concerns about AI safety and competitiveness (OpenAI overview). If regulators decide that a handful of hyperscalers controlling both chips and models is unacceptable, we could see requirements for shared access, public‑private compute pools, or mandated licensing that forces hardware owners to expose fair‑priced capacity to the broader ecosystem.
Second, model specialization can outpace general‑purpose hardware advantages. Anthropic’s Claude line shows how much performance can be squeezed out of careful alignment, data curation, and architecture choices without necessarily winning the brute‑force compute race (Claude 3.5 Sonnet). In Amp’s harness, Claude is frequently the default choice for serious code reasoning work, despite the existence of more parameter‑dense models elsewhere (Amp models). If future harnesses increasingly route tasks to small, specialized models that run efficiently on commodity hardware, the marginal advantage of owning the highest‑end accelerators could narrow for many workloads.
Third, agentic ergonomics might matter more than raw model IQ for years. Developers and operators don’t experience “the model”; they experience the harness: how quickly it understands the repo, how reliably it edits code, how gracefully it recovers from errors. Independent tools like Amp, Kilo, and Factory have already shipped UX patterns—Oracle passes, Librarian‑style cross‑repo search, pipeline‑aware Droids—that the majors are still iterating toward (Amp Owner’s Manual; Kilo model grading; Factory home). Per this blog’s earlier deep dive into the role of AI engineers, the teams that win in the medium term are the ones that can orchestrate messy, real‑world systems, not just call a single model endpoint (What is an AI Engineer?).
From that perspective, hardware control looks less like a guaranteed checkmate and more like a structural tailwind. It’s powerful, but it still has to be translated into developer experience, reliability guarantees, and a meaningful reduction in cognitive load. Until that translation is complete, there is space—real, revenue‑bearing space—for independent harnesses to deliver better work on top of someone else’s silicon.
Google’s Endgame, and the Slice Left for Builders
If you force the future into a single sentence, it’s this: in the long run, the companies that control both the hardware and a good‑enough model will own the vast majority of agentic value, but in the medium run, independent harnesses are where the real innovation—and the most interesting acquisitions—will come from.
Google is the canonical example of the endgame. A stack that runs from custom TPUs through Gemini models up into IDE plug‑ins and CLI tools is structurally advantaged against any harness that has to negotiate token prices with third parties (Tensor Processing Unit overview). When a mission‑critical agent crashes midway through refactoring a payments system, the operator cares less about which model family was used and more about whether the SLA holds. Owning the chips means Google can reserve headroom for its own services, guaranteeing triple‑nine uptime where others are still juggling rate limits and quota.
OpenAI sits in an ambiguous middle. ChatGPT remains the most widely recognized agentic interface on the planet, and Wikipedia’s coverage makes clear how quickly it has gone from research demo to mass‑market product (ChatGPT). The company’s partnership with Microsoft gives it preferential access to cutting‑edge hardware, but it still does not design its own accelerators (OpenAI overview). That dependence means that, over a long enough time horizon, OpenAI will either need to move further up the stack into distribution and enterprise workflows—or down the stack into closer control of its silicon supply—if it wants to keep pace with vertically integrated rivals.
Anthropic, for its part, has staked its future on trust and reliability. The Claude 3.5 Sonnet announcement emphasizes outperforming prior Claude models and competitor systems on key evaluations at twice the speed (Claude 3.5 Sonnet). In a world where agents are editing production infrastructure, that reliability narrative matters as much as any token‑per‑second metric. Harnesses like Amp, which lean heavily on Claude for their most delicate subagents, are effectively importing Anthropic’s reliability brand into their own UX (Amp models). That symbiosis hints at one possible future: best‑in‑class harnesses becoming the de facto “experience layer” for specific frontier models.
For independent players like AmpCode, Kilo Code, and Factory, the realistic path forward looks like a mix of deep niche ownership and eventual consolidation:
- Amp can become the canonical harness for serious, long‑horizon code work—an “agentic IDE” that organizations standardize on for complex refactors and infrastructure changes.
- Kilo can own the fast‑twitch developer in VS Code and JetBrains, especially in markets where locally popular models like MiniMax M2 and Kimi K2 compete strongly on price and latency (Kilo model grading).
- Factory can own the pipeline surface, translating product requirements into Droids that ship and monitor real systems end‑to‑end (Factory home).
In that world, the exit paths are obvious. Cloud providers and model labs under pressure to improve their agent UX can buy harness companies outright rather than reinventing years of subtle product design. The harnesses bring opinionated workflows, telemetry, and battle‑tested patterns for failure recovery; the acquirers bring cheaper, more reliable compute. It is not hard to imagine an Amp‑like tool becoming Google’s default “Gemini for Codebases,” a Kilo‑like agent anchoring a regional cloud’s developer story, or a Factory‑style Droid platform folding into an enterprise CI suite.
The operator’s job, then, is not to predict which harness survives as an independent entity, but to reason clearly about where value accrues over the next three to five years:
- Use independent harnesses now to compound your team’s velocity, especially in workflows where the majors are still catching up.
- Keep an eye on hardware and model roadmaps; when a platform owner’s harness is “good enough” and backed by reserved compute, be ready to migrate the most critical surfaces.
- Treat vendor lock‑in as a portfolio decision, not a moral one. You want at least one escape hatch—often an independent harness that speaks multiple APIs—so you can arbitrage performance, price, and reliability as the landscape shifts.
For now, the small slice of pie available to AmpCode, Kilo, Factory, and their peers is very real. They’re shipping faster, learning from real‑world repos, and building muscles the giants still lack. Over the long arc, your thesis—that entities like Google, which control both hardware and sufficiently capable models, will dominate—remains the most probable equilibrium. But history suggests that the most interesting leverage often belongs to the teams that build the bridges between epochs.
If you’re building in this moment, the question is not “Will independent harnesses exist forever?” It’s “How much compounding can you extract from them before the air gets thin?” The answer, for at least the next few years, is: more than enough to matter.