Goals or outcomes: the coding-agent fork in the road
/ 15 min read
Table of Contents
The agent loop has split into two religions
In the same nine-day window, the two companies that share the frontier of agentic coding shipped opposing answers to the same problem. On April 30, OpenAI added a /goal slash command to Codex CLI 0.128.0 and turned the Ralph loop — that homespun “keep retrying until it works” pattern — into a first-class CLI primitive, complete with persistence, budget controls, and resume semantics, as Simon Willison flagged in his close reading of the changelog. On May 6, at Anthropic’s Code w/ Claude 2026 conference, the company unveiled three additions to its Managed Agents stack — Multi-agent orchestration, Outcomes, and Dreaming — and parked the centerpiece, Outcomes, on a different premise: instead of running a loop until the model is satisfied with its own work, run a loop until a separate grader, with its own context window and the rubric you wrote, signs off. Same problem. Opposite stopping rule.
The difference looks small at the surface. Both features take a long-running coding agent and make it durable across iterations. Both let a developer kick off a task, walk away, and come back to a finished artifact. But the philosophies underneath are genuinely incompatible. Codex’s /goal treats persistence as a runtime concern: the loop is a way to recover from interruptions, exhaust the budget, and survive an overnight session in a Tmux tab. Anthropic’s Outcomes treats persistence as a quality contract: the agent runs until an external evaluator confirms it cleared a bar that a human authored. One bets that compute-times-tenacity solves the long-tail problems; the other bets that quality is a verification problem disguised as a generation problem.
That distinction is now load-bearing. Anthropic’s annualized revenue crossed $30 billion in April 2026, with the company holding 54% of the enterprise coding market against OpenAI’s 21%, per Menlo Ventures’ panel of 150 technical leaders. Claude Code alone is reported to run at a $2.5 billion annualized run rate as of February 2026, and business subscriptions to it have quadrupled since the start of the year. OpenAI is leaning back into the developer wedge after losing it once already, with GPT-5.5 in Codex available across Plus, Pro, Business, and Enterprise tiers and a $200/month Pro tier targeted at long-running agent sessions. The companies do not need to win; they need to keep convincing CIOs that their model of “agent autonomy” is the one to standardize on. Goals or outcomes is now the question on the procurement form.
Follow the rubric, find the moat
Start with the mechanics, because the marketing language flattens what is genuinely different code. Codex /goal works by injecting two prompt templates — goals/continuation.md and goals/budget_limit.md — at the end of every turn inside a single growing session, as the leaked CLI internals show. The model judges whether to continue based on its own evaluation of progress against a budget; persistence flows from the runtime giving the same context window more turns. The Codex April 2026 changelog calls this “stateful work instead of a single disposable turn,” and it lists the canonical use cases as multi-step migrations, maintenance loops, QA passes, and content pipelines. It is, in other words, a Ralph loop with a UI.
Anthropic’s Outcomes is structurally a different object. Per the Claude Managed Agents blog post, a developer hands the platform a rubric describing what success looks like, and Anthropic spins up a second agent — a grader — that lives in its own context window and never sees the worker agent’s reasoning. The grader scores each output against the rubric; if the work fails, the grader returns a structured verdict (needs_revision, with the specific gaps named), and the worker agent takes another pass with that feedback in hand. The terminal status is one of three: satisfied, needs_revision, or max_iterations_reached, per the practical comparison developersdigest published. Webhooks fire on completion. There is an audit trail. The “definition of done” is the API contract.
The internal benchmarks Anthropic published support the design choice with surprising specificity. The company reports task success up by as much as ten percentage points on its standard prompting-loop baseline, with the largest deltas concentrated on harder problems where the worker agent would otherwise declare victory prematurely. On structured-document generation specifically — a workload that has bedeviled coding agents because docx and pptx outputs look superficially correct while failing on layout, tone, or compliance — Outcomes posted +8.4% on docx and +10.1% on pptx over the same baseline. The takeaway is not that the worker model got smarter. The takeaway is that giving the loop an external scorekeeper, with its own context, eliminated the failure mode where the model talks itself into accepting bad output. That is a verification gain, not a generation gain — and it is the kind of gain that a benchmark-first culture tends to undervalue.
The companion features make the bet legible. Multi-agent orchestration lets a lead agent decompose a problem and dispatch specialist subagents to a shared filesystem, with every step traceable in the Claude Console. Netflix’s platform team is the named launch customer; their analysis agent processes build logs from hundreds of CI runs in parallel and surfaces only the patterns that recur across applications. Dreaming, the more speculative of the three and parked in research preview, schedules an offline session in which the agent reviews its prior runs and curates new memory files — Harvey reports completion rates rising six-fold in internal tests, although the methodology behind that number is thin in the public materials. Each piece points the same direction: agent quality is a property of the system around the model, not just the model itself. That framing is consistent with what I argued in the Opus 4.7 coding-crown post — Anthropic has decided that the next ten points of usable performance live in the harness, not the weights.
OpenAI’s design choice is no less coherent. GPT-5.5 was pitched in April as a more token-efficient generalist, and the vendor benchmarks back that up — GPT-5.5 reportedly produces 72% fewer output tokens on equivalent tasks while leading planning-and-execution evals like Terminal-Bench 2.0 (82.7% vs Opus 4.7’s 69.4%) and Toolathlon. If the model is cheaper to run, longer loops with no external evaluator are economically viable. /goal is the natural product of that math: when you have the most token-efficient frontier coder, “let it keep going” is a defensible default. Anthropic, whose Opus 4.7 beats GPT-5.5 on SWE-Bench Pro 64.3% to 58.6% but burns more tokens doing it, has every incentive to charge for verification scaffolding around a more expensive worker.
Here is the proprietary takeaway, stitched from the public numbers: if Anthropic’s claimed +10pt task-success delta from Outcomes generalizes to enterprise workloads, it more than compensates for Opus 4.7’s higher token cost relative to GPT-5.5 — because the alternative is rerunning a Codex /goal session that produced subtly broken output and only discovered it during human review. With Claude Code already commanding more than half of enterprise coding spend, that math reframes the buyer’s calculus from “which model is cheaper per token” to “which platform writes off the fewer tickets in week six.” The Codex Pro tier at $200/month is priced for tenacity. Outcomes is priced for verification. They will eventually meet at a coupon-clipped enterprise contract that lets buyers have both — but only one of them sets the contractual definition of done.
There is a second layer to the math worth pricing in. The Outcomes grader runs in a separate context window — that is the entire point of the feature — and a separate context window means a separate input bill. For a moderately complex coding task that produces, say, 12K worker-token output across three iterations, a grader that re-reads the full output plus the rubric on each turn can easily double the per-task token budget. Anthropic has not yet published a discrete grader-tier price, but the cost shape is now visible: workers pay generation rates, graders pay something close to evaluation rates, and the bundle is competitive only if the verification gain is real. That is a falsifiable claim, and the next two quarters of customer telemetry will either confirm or shred it.
Where the outcomes bet could blow up
The skeptical case starts with the obvious: Outcomes ships in public beta against a Codex feature that is already shipping in CLI 0.128.0 to anyone with a ChatGPT subscription. By the time Anthropic’s grader is generally available, OpenAI will have iterated /goal through several minor versions, possibly with its own evaluator agent grafted on. Both companies have demonstrated the ability to copy the other’s primitives within a quarter — Anthropic shipped its official ralph-wiggum plugin for Claude Code in late 2025, and OpenAI added browser-use to Codex on April 23 only weeks after Anthropic’s similar feature. Conceptual moats in the agent runtime layer are, on the available evidence, a few weeks deep at most.
The harder critique is that Outcomes pushes the failure mode from the worker to the rubric author. A grader with a sloppy or under-specified rubric will rubber-stamp the same broken output the worker would have produced, only with more API calls and a delayed delivery. The cookbook materials Anthropic published, including the outcome-grader walkthrough, implicitly acknowledge this; they spend significant time on rubric construction patterns and on edge cases where the grader and worker fall into a stable but wrong equilibrium. Rubric design is, effectively, prompt engineering for evaluators — a skill that few engineering teams have at scale, and one that is harder to outsource than most. Companies with mature QA cultures will love this; companies that have always shipped on intuition will find it onerous.
A second weakness sits in the benchmarks themselves. GPT-5.5 leads Opus 4.7 on Terminal-Bench 2.0 and the planning-and-execution evals where /goal-style loops shine, as the vellum.ai analysis lays out, and the token-efficiency gap is real. For high-volume agent workloads — the kind of background coding tasks that run unattended overnight on a fleet of repos — runtime cost dominates verification quality. A team that triggers two thousand small agent runs a week may find that Codex /goal plus light human review is materially cheaper than Outcomes plus a maintained rubric library. The Anthropic pitch presumes that verification overhead pays back in fewer broken artifacts; that math may not work for shops where artifacts are cheap to fix and expensive to gate. Internally, this is the same tension I flagged in the Anthropic-SpaceX Colossus post — verification is compute-intensive, and Anthropic is buying compute on terms that suggest it expects the verification volume to keep climbing.
A third concern is the speech-versus-conduct issue around graders, which sounds esoteric until you read the small print on Outcomes. The grader’s rubric is a contractual artifact; an enterprise that defines “passes acceptance” via Anthropic-hosted graders has now embedded vendor-specific language into its acceptance test infrastructure. Switching costs go up, not down, the longer that arrangement runs. OpenAI knows this — every announcement coming out of the Codex flexible-pricing rollout emphasizes runtime portability and avoiding vendor capture. There is a real argument that /goal is more buyer-friendly precisely because it is less opinionated about quality. Smart procurement leads will weigh that.
The competitive context cuts both ways. Anthropic’s 54% coding market share is not destiny; OpenAI’s GPT-5.5 launch and the Codex pricing realignment to API-token billing on April 23 across all Enterprise plans are the most aggressive moves the company has made in developer tooling since the original Codex preview. Pricing alignment to API tokens means heavy users now scale predictably with their consumption rather than running into per-message ceilings — historically a friction point for teams using Codex inside automated pipelines. That change alone may close some of the share gap, regardless of what either side ships next.
There is also the dreaming question — and dreaming is the moat in disguise. Anthropic positions Dreaming as memory curation, but the deeper read is that Dreaming is how the company plans to compound the value of every Outcomes run over time. Every graded session is a labeled training signal: the worker output, the rubric, the grader’s verdict, and the eventual revision. That is exactly the kind of feedback data that a closed-loop RL pipeline thrives on. If Anthropic uses anonymized, aggregated traces from Outcomes-graded sessions to refine the next Opus model — and the Anthropic Managed Agents documentation leaves the door open to this — then Outcomes is not a feature; it is a data flywheel masquerading as a feature. OpenAI’s /goal produces no comparable signal because there is no labeling event. The single biggest risk to OpenAI’s coding business is not that Outcomes wins on benchmarks today; it is that Outcomes-generated training data lets Anthropic open a wider gap on the next model.
Counter-argument to the counter-argument: Anthropic could face exactly the regulatory exposure I covered in the Pennsylvania Character.AI case if its grader produces decisions that downstream customers treat as authoritative — a fact Anthropic’s enterprise legal team is presumably tracking. There is no body of US law on automated quality graders for software artifacts yet, but if Outcomes traces become evidence in software-defect lawsuits, the trace becomes a liability vector at the same time it is a moat.
What to install before the next sprint
The reasonable read of the next twelve months is that both stopping rules survive, but they end up serving different operational seats. /goal becomes the default for individual developer workflows where speed and recovery dominate — the bash-script equivalent for agent-era coding, durable across reboots and context limits, judged by whether the test suite turns green. Outcomes becomes the default for cross-functional and regulated workflows where multiple stakeholders need a paper trail — compliance reviews, content pipelines with brand standards, multi-team handoffs where the receiving team’s definition of “done” must be encoded somewhere a manager can audit. The hybrid pattern, exactly as developersdigest’s practical comparison recommends, is to use /goal for execution durability inside a developer’s IDE and Outcomes as the gate before any artifact ships. That stack is more expensive than either alternative alone but better than the human-in-the-loop status quo for any team running more than a handful of agents.
The architectural lesson runs deeper than the feature comparison. Generation has stopped being the bottleneck in agentic systems; verification has. The companies that figure out scalable verification — through external graders, multi-agent cross-checks, deterministic test scaffolds, or some combination — will win the next round of the enterprise contracts. The companies that bet on bigger context windows and cheaper tokens alone will lose those contracts to the companies that ship trustworthy artifacts. The structural shift in the agentic coding trends report Anthropic released at Code w/ Claude 2026 is exactly this: the percentage of enterprise builds that involve a verification step before merge has climbed from 22% in the first quarter of 2025 to a reported 61% in the first quarter of 2026. That is the curve to track.
Where this leads procedurally is also forecastable. By Q3 2026, expect OpenAI to ship its own external-grader primitive — likely as an extension of Codex’s reviewer agent feature — and to position it as evaluator-agnostic compared with Anthropic’s tighter coupling. Expect Anthropic to extend Outcomes from documents and code into UI flows, with screenshot-based graders that compare actual rendered outputs against design specs. Expect both companies to pitch hybrid runtime-and-quality stacks to large enterprise accounts where the buyer wants to standardize on a single agent platform. And expect Cursor, Cognition, and the rest of the application-layer tooling vendors to integrate at least one of the two stopping rules within the quarter — whoever ships the first universal grader API wins a level of platform leverage neither model lab currently has.
The operator checklist if you are running coding agents at scale right now:
- Pick a stopping rule per workflow, not per team. Code review and content generation have different acceptance criteria. Mixing
/goalfor one and Outcomes for another is a feature, not a bug. - Write rubrics like you write tests. Outcomes only works as well as the rubric. Treat them as versioned artifacts with their own review process and their own regression cases. A bad rubric is silently more dangerous than no rubric at all.
- Budget for verification compute, not just generation compute. External graders cost real money. If your agent budget assumed worker-only token math, double it for any workflow you put behind Outcomes.
- Keep
/goalsessions bounded and observable. Long Codex/goalruns without tracing produce hallucinated progress that surfaces as silent failures days later. Pair with Codex’s reviewer agents and explicit budget caps. - Store your traces. Whichever vendor you choose, the grader-and-worker traces are training-quality data. Negotiate ownership and exportability into your contract before the renewal cycle pins you.
- Pilot the hybrid stack on one squad. A single team running Codex for execution and Outcomes-graded handoff to QA is the cheapest way to learn whether the verification overhead pays back in your codebase. Six weeks is enough.
- Watch the dreaming features quietly. Anthropic’s Dreaming and any forthcoming OpenAI memory-refinement equivalent are how vendor lock-in deepens after the contract is signed. Decide your data-residency posture before you opt in, not after.
The fork in the road that opened on April 30 and closed on May 6 was real, and it changed the shape of every coding-agent procurement conversation that lands on a CIO’s desk between now and year-end. One vendor is asking you to trust the loop. The other is asking you to write the rubric. The honest answer for most enterprises is that they need both — and the platform that makes the seam between them invisible is the one that compounds across the next two model generations.