Photo by BoliviaInteligente on Unsplash
Cursor's Agent Swarm Built a Browser in Six Days
/ 16 min read
Table of Contents
Cursor’s engineering team decided to attempt something audacious in early January 2026: unleash hundreds of AI agents to build a fully functional web browser from scratch. Not a wrapper around Chromium or a skin over WebKit, but an actual rendering engine with HTML parsing, CSS layout, JavaScript execution, and networking—built entirely by autonomous AI systems coordinating with each other. The project took six days. The first two were a disaster. The final four produced a working browser that could render most websites correctly, complete with tab management, history, and bookmarks. What the team learned along the way reshapes how we should think about multi-agent systems, coordination patterns, and—most surprisingly—why the prompts you write matter more than which model you choose or how sophisticated your orchestration framework becomes.
The experiment began as a stress test for Cursor’s long-running autonomous coding capabilities, which the company had been refining throughout 2025. Building a browser represented the ultimate challenge: browsers rank among the most complex software artifacts ever created, with millions of lines of code in production implementations, decades of accumulated specifications, and edge cases that would fill libraries. Google employs thousands of engineers on Chrome. Mozilla has maintained Firefox for over two decades. The idea that AI agents could produce anything comparable seemed like hubris—which is precisely why Cursor chose it. If agents could make meaningful progress on a browser, they could probably handle most software engineering tasks.
The experiment also served as a proving ground for multi-agent coordination patterns that had been theorized but rarely tested at scale. OpenAI’s recent research on chain-of-thought monitorability had explored how to supervise AI systems making complex decisions—directly relevant to coordinating hundreds of autonomous agents. Anthropic’s documentation on Agent Skills had formalized patterns for extending agent capabilities through modular, composable instructions. But these were largely conceptual or focused on single-agent scenarios. Cursor wanted to see what actually worked when you scaled from a few agents to hundreds, working on a codebase that would grow to hundreds of thousands of lines.
The committee that couldn’t ship
The first architecture seemed reasonable on paper. Cursor deployed roughly 200 agents, all running GPT-5.2, with equal status and authority. Agents would coordinate through a shared planning document—a kind of distributed Kanban board where any agent could claim tasks, report progress, and flag blockers. Lock mechanisms prevented two agents from editing the same file simultaneously. The expectation was emergent coordination: agents would naturally partition work, avoid conflicts, and converge on a coherent codebase.
What actually happened resembled a poorly run corporate committee. Agents became obsessively cautious. When one agent needed to modify a file, it would acquire a lock—then hold that lock for extended periods while it deliberated on whether its changes might conflict with work happening elsewhere. Other agents, seeing locks on files they needed, would either wait indefinitely or pivot to less important tasks. The planning document became a bottleneck as agents spent more cycles updating their status than writing code.
Worse, the agents developed what Cursor’s engineers described as “diffusion of responsibility.” When a bug appeared in the rendering engine, no single agent owned the problem. Multiple agents would investigate the same issue, reach similar conclusions, and then hesitate to implement fixes because they weren’t sure if another agent was already working on it. The same defensive behavior that makes humans ineffective in committee settings emerged spontaneously in AI agents operating as peers.
The flat hierarchy also produced architectural incoherence. Different agents made different assumptions about data structures, API boundaries, and error handling conventions. The networking agent expected one interface for HTTP responses; the rendering agent expected another. These mismatches accumulated into a codebase that looked like it had been written by a hundred different people who never spoke to each other—which, in a sense, it had been.
By day two, the team had a browser that could fetch HTML over the network and display text on screen. But the code was unmaintainable, the agents were deadlocked in coordination overhead, and progress had ground to a halt. The experiment appeared headed for failure.
Hierarchy emerges as the solution
The second attempt restructured everything around clear role separation. Instead of 200 peer agents, Cursor established a hierarchy with three tiers: Planners, Workers, and a single Judge Agent. The Planners (roughly 10 agents) were responsible for architecture, task decomposition, and coordination. Workers (roughly 150 agents) executed specific implementation tasks assigned by Planners. The Judge Agent reviewed all code before it could be merged, checking for consistency, correctness, and adherence to architectural decisions.
This pattern mirrors what Anthropic calls the “orchestrator-workers workflow”—a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes results. The key insight is that the orchestrator maintains global context while workers operate in focused, scoped environments. Workers don’t need to understand the whole system; they need to understand their task and execute it well. Anthropic’s Agent Skills framework explicitly recommends modular, composable patterns for complex tasks where subtasks cannot be predicted in advance—precisely the situation with building a browser.
The Judge Agent represented an innovation beyond standard orchestrator-workers. Drawing on the “evaluator-optimizer” pattern, the Judge didn’t just approve or reject code; it provided feedback that workers could use to improve their submissions. A worker might submit a CSS parsing function, receive feedback that it didn’t handle malformed selectors gracefully, and revise the implementation before final approval. This created an iterative refinement loop within the larger system.
Role separation solved the lock contention problem immediately. Workers no longer competed for files because Planners explicitly assigned file ownership. The diffusion of responsibility disappeared because every task had a clear owner. Architectural coherence improved because Planners maintained a shared understanding of system design and propagated that understanding through task specifications.
The hierarchical system also enabled specialization. Some workers became de facto experts in specific subsystems—the layout engine, the JavaScript interpreter, the networking stack—because they handled repeated tasks in those areas. This emergent specialization mirrors how human engineering teams naturally develop expertise concentrations, but it happened organically through task assignment patterns rather than explicit role definitions.
Progress accelerated dramatically. By day four, the browser could render HTML and CSS correctly for simple pages. By day five, JavaScript execution worked for basic scripts. By day six, the team had a browser that handled most of the web, with remaining gaps in advanced CSS features and newer JavaScript APIs. The codebase was coherent enough that human engineers could understand and modify it.
The experiment revealed a fundamental truth about multi-agent systems: equality is not optimal. Agents operating as peers create coordination overhead that scales poorly. Hierarchy—with clear authority, responsibility, and information flow—enables the parallelism that makes multi-agent systems valuable while avoiding the deadlocks and inconsistencies that destroy them.
Prompts won, models and harnesses came second
Throughout both phases of the experiment, Cursor tested multiple frontier models: GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. They also varied the orchestration harness and—most consequentially—the prompts given to agents. The performance differences across these three dimensions were notable, unexpected, and inverted conventional assumptions about what matters most in agentic AI.
GPT-5.2 consistently outperformed other models for the long autonomous work that browser construction required. The advantage wasn’t raw intelligence or coding ability; on individual tasks, the models performed comparably. The difference was behavioral: GPT-5.2 followed instructions more reliably and avoided drift over extended trajectories. When given a task specification, it executed that specification without adding unrequested features, making unauthorized architectural changes, or gradually wandering from the assigned work.
Claude Opus 4.5, despite scoring higher on some benchmarks, exhibited what engineers called “creative interpretation”—a tendency to improve upon or extend task specifications in ways that seemed helpful but introduced inconsistencies at scale. An Opus worker asked to implement an HTTP client might decide to add caching functionality, unaware that a Planner had assigned caching to a different agent. These well-intentioned additions created merge conflicts and architectural contradictions.
Gemini 3 Pro showed strong performance but occasionally exhibited context degradation on very long tasks, producing outputs that seemed to forget earlier specifications. This proved particularly problematic for workers maintaining large files or implementing features that spanned multiple sessions.
The GPT-5.2 advantage aligns with what Cursor had observed in their dynamic context discovery research: as models improve as agents, success increasingly depends on reliability and consistency rather than raw capability. A model that executes instructions perfectly is more valuable in an agentic context than a model that executes instructions brilliantly but unpredictably. OpenAI’s emphasis on instruction-following in the GPT-5 series appears to pay dividends specifically in autonomous settings where human oversight is minimal.
But model selection, while important, proved secondary to prompt quality. This was perhaps the most significant finding from Cursor’s experiment: the relative importance of prompts compared to the agent harness or the underlying model. The team systematically varied three factors: the model (GPT-5.2, Claude Opus 4.5, Gemini 3 Pro), the harness (orchestration framework, tool configurations, memory systems), and the prompts (system messages, task specifications, few-shot examples). Of these three levers, prompts dominated so thoroughly that the team initially suspected measurement error.
A well-crafted prompt running on GPT-5.2 with a basic harness outperformed poorly-crafted prompts on any model with sophisticated orchestration. Conversely, sophisticated harnesses couldn’t rescue vague or contradictory prompts. The prompts were the specification; everything else was execution. When the team quantified the impact, they found that prompt quality explained roughly 60% of variance in agent performance, while model selection explained about 25% and harness sophistication explained the remaining 15%.
This finding deserves unpacking because it contradicts a prevailing assumption in the AI engineering community. Significant effort has gone into building elaborate agent frameworks—LangChain, AutoGPT, CrewAI, and dozens of others—that promise to handle the complexity of agent orchestration. These frameworks provide abstractions for memory management, tool use, planning, and coordination. The implicit assumption is that the framework does the heavy lifting while prompts fill in application-specific details. Venture capital has flowed to framework startups on this premise. Engineering teams evaluate agent capabilities by the sophistication of their orchestration systems.
Cursor’s experiment suggests the opposite hierarchy. As OpenAI’s research on chain-of-thought monitorability recently demonstrated, what matters most is the reasoning process itself—not the infrastructure around it. The same principle applies to agent frameworks: they primarily manage boilerplate like calling APIs, parsing responses, maintaining conversation history, and handling tool invocations. These are necessary but not differentiating. The actual intelligence of the system comes from what you put in the prompts. The framework is plumbing; the prompt is architecture.
Anthropic’s guidance on agent development reinforces this point: “We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code.” The company found that the most successful agent implementations in production use “simple, composable patterns rather than complex frameworks.” The value lies in prompt engineering, not in framework sophistication. Teams that invest in elaborate orchestration often find themselves debugging framework abstractions rather than improving agent behavior.
Consider what the harness actually does in a well-designed agent system. It manages context windows, ensuring agents receive relevant information without exceeding token limits. It provides tool interfaces, translating between model outputs and external APIs. It handles memory, persisting information across conversation turns. It orchestrates multi-step workflows, chaining model calls together. All of these functions are important. None of them determine whether the agent accomplishes its task effectively. A harness can’t make a model understand a problem it wasn’t prompted to understand. A harness can’t prevent a model from making mistakes that the prompt didn’t anticipate.
The model provides raw capabilities—reasoning ability, coding knowledge, language understanding. Better models enable better outcomes, all else equal. But “all else equal” never holds in practice. A more capable model given ambiguous instructions will generate ambiguous results with more fluency and confidence. The capabilities must be directed, and prompts do the directing. Cursor found that upgrading from GPT-5.2 to a hypothetical GPT-6 would matter less than improving prompts by 20%—a finding that should give pause to organizations that reflexively upgrade models while leaving prompts unchanged.
What made Cursor’s prompts effective? Several principles emerged from the experiment, each rooted in reducing ambiguity and increasing specificity.
First, prompts specified not just what to do but what not to do. The most effective worker prompts included explicit boundaries: “Implement only the feature described. Do not add optimizations, extensions, or improvements beyond the specification. If you identify potential enhancements, document them but do not implement them.” These negative instructions prevented the drift that plagued some models. Without them, agents interpreted their task charitably, adding features they thought would be helpful—features that created conflicts with other agents’ work.
Second, prompts established clear success criteria. Rather than asking agents to “implement a CSS parser,” effective prompts specified: “Implement a CSS parser that handles selectors, properties, and values according to CSS 2.1 specification. The parser must reject malformed CSS with informative error messages. Include unit tests that verify parsing of these ten example stylesheets.” The concreteness enabled agents to know when they were done and evaluate their own work. Vague prompts produced vague outcomes; precise prompts produced precise outcomes.
Third, prompts used structured formats with XML tags and clear sections—a practice recommended in both Anthropic’s and OpenAI’s prompt engineering documentation. Planners specified tasks using consistent templates with sections for objective, constraints, inputs, expected outputs, and success criteria. This structure made it easier for workers to parse specifications and for the Judge to evaluate compliance. Unstructured prose prompts led to unstructured outputs; agents seemed to mirror the organization of their instructions.
Fourth, prompts included few-shot examples. A worker implementing networking features received examples of well-structured code from other parts of the codebase, demonstrating expected conventions for error handling, logging, and API design. These examples anchored the worker’s style to the project’s norms, reducing inconsistency without requiring explicit style guides. Few-shot learning proved more effective than detailed verbal descriptions of coding conventions.
Fifth, prompts maintained context about the broader system. Workers received not just their immediate task but a summary of the overall architecture, their component’s role within it, and the interfaces they needed to respect. This context enabled local decisions that remained globally coherent. Agents without architectural context made locally reasonable choices that created system-level problems.
Sixth, prompts evolved throughout the project. The team discovered that initial prompts required refinement based on observed failure modes. When agents consistently made certain mistakes, the prompts were updated to address those specific issues. This iterative prompt engineering—writing, observing, refining—proved essential. First-draft prompts never worked optimally, no matter how carefully considered.
The harness mattered, but primarily for enabling prompt effectiveness. Cursor’s harness provided context management—giving agents access to relevant files without overwhelming their context windows—and tool integration. The dynamic context discovery techniques Cursor had developed proved valuable: rather than loading all potentially relevant information upfront, the harness allowed agents to pull context on demand. This reduced noise and improved focus. But the harness was thin, almost invisible to the agents themselves. The heavy lifting happened in the prompts.
What the harness could not do was compensate for prompt deficiencies. When prompts were unclear about error handling, agents produced inconsistent error handling regardless of harness sophistication. When prompts failed to specify interface boundaries, agents violated those boundaries regardless of available tools. The harness could provide information and capabilities; it could not provide understanding.
Model selection mattered less than prompt quality. A well-prompted GPT-5.2 outperformed a poorly-prompted Claude Opus 4.5, even though Opus scores higher on some benchmarks. The implication is that organizations should invest primarily in prompt engineering talent and processes rather than in chasing the latest model releases or building elaborate orchestration systems. Prompt engineering has higher return on investment than model optimization or harness development.
This hierarchy—prompts over harness over models—inverts conventional priorities in the AI engineering community. Startups pitch sophisticated agent frameworks. Researchers publish papers on novel architectures. Model providers compete on benchmark scores. Marketing emphasizes capability improvements measured in fractions of a percentage point on standardized evaluations. But the practical lesson from Cursor’s experiment is that these factors matter less than getting the prompts right. A junior engineer with a well-crafted prompt template can outperform a sophisticated system with vague instructions.
The economics reinforce this conclusion. Improved prompts are essentially free—they require engineering time but no incremental compute or licensing costs. Better models typically cost more per token. More sophisticated harnesses require development and maintenance. The highest-ROI intervention is also the cheapest.
The conclusion is almost anticlimactic: building effective agents is primarily a writing problem. The prompts are documentation that the model must follow. Like all documentation, clarity wins over cleverness. Specificity beats generality. Constraints enable freedom by reducing ambiguity. The best agent engineers may be the ones who write the clearest specifications, not the ones who build the most complex systems. This is good news for organizations without massive AI infrastructure budgets: the primary competitive advantage in agent development is clear thinking expressed in precise language.
What the browser teaches about agent futures
Cursor’s browser experiment offers lessons that extend beyond multi-agent systems into the broader question of how AI will transform software engineering. The browser now works. It’s not Chrome-quality—it struggles with complex JavaScript, advanced CSS layouts, and certain edge cases in HTML parsing. But it renders websites, manages tabs, and provides a usable browsing experience. Six days of agent work produced what would have taken a human team months or years.
The success suggests that AI agents can handle substantial software engineering work, including the creation of complex systems from scratch. The failure mode—flat hierarchies producing coordination overhead and architectural incoherence—suggests that throwing more agents at a problem without organizational structure produces diminishing or negative returns. The sweet spot involves hierarchical coordination with clear roles, specific prompts, and models optimized for instruction-following over raw intelligence.
The primacy of prompts implies that prompt engineering will become a critical competency for organizations deploying agents. This is not about knowing magic words that unlock model capabilities; it’s about writing clear, precise specifications that leave minimal room for misinterpretation. The skills involved—technical writing, requirements specification, test case design—are traditional software engineering competencies applied to a new medium.
For practitioners building agent systems, the Cursor experiment suggests starting simple. Use a basic harness. Focus energy on prompt quality. Test with realistic workloads that expose coordination failures. Add framework sophistication only when you’ve demonstrated that simpler approaches fail. Hierarchy is not a limitation to be overcome; it’s a feature that enables scale.
The browser sits on Cursor’s internal servers, a proof of concept that probably won’t ship as a product. But its existence proves something important: the constraints on what AI agents can build are loosening rapidly. Six days. Hundreds of agents. A working browser. The specifications we write for these systems—the prompts—determine what they produce. The implications for software engineering are profound and still unfolding.