skip to content
Stephen Van Tran
Table of Contents

Four days ago, Anthropic dropped a model without a press event, a countdown timer, or a celebrity demo reel. Claude Opus 4.7 shipped on April 16 as a quiet successor to Opus 4.6, priced identically, available immediately on the same infrastructure, and carrying a set of benchmark scores that reopen a leaderboard that looked settled for the first time since GPT-5.4 dropped in March. What made the release remarkable was not the marketing but the numbers — specifically a 10.9 percentage-point jump on SWE-bench Pro, the most widely cited proxy for real-world software engineering capability, that put Anthropic back at the front of the pack on five of the industry’s major benchmark tables. For enterprise buyers who have spent the past six weeks choosing between GPT-5.4 and Gemini 3.1 Pro, that number forces a reopen of the evaluation.

The context matters. Anthropic enters this release on arguably the strongest commercial footing in its history. The company overtook OpenAI in annualized revenue for the first time in April, crossing $30 billion ARR while spending roughly four times less on training its models. Eight of the Fortune 10 are customers. More than a thousand business accounts spend over a million dollars annually on Claude. That commercial foundation creates the runway to ship improvements at this cadence — and Opus 4.7 is a direct product of that momentum.

The throne was never permanently occupied

The history of AI model rankings since GPT-4 arrived in 2023 is best described as a relay race with no finish line. Google led briefly on multimodal tasks, OpenAI led on reasoning, Anthropic led on coding, xAI briefly topped real-time web-search benchmarks, and the lead swapped hands on average every eight weeks through 2025. The pace of that churn accelerated in early 2026: GPT-5.4 launched in March and immediately claimed the top position on autonomous desktop task benchmarks, reaching 75% on OSWorld where the human baseline sits at 72.4%, becoming the first frontier model to demonstrate reliably superhuman performance on computer-use tasks. Gemini 3.1 Pro then carved out a permanent cost-efficiency lead with pricing at $2 per million input tokens — roughly 60% cheaper than Claude — while maintaining competitive scores on vision and document understanding workloads. By the first week of April, most practitioner discussions had converged on a working assumption: Claude was the coding specialist of the prior cycle, GPT-5.4 was the emerging computer-use leader, and Gemini was the economics winner. Opus 4.7 disrupts that narrative on precisely the axis where it was most expected to be challenged.

The stakes of the leaderboard contest are no longer theoretical. Agentic coding pipelines — systems where a model autonomously decomposes, implements, tests, and iterates on software tasks without human checkpoints — are moving from experimental to production. Major engineering organizations are committing multi-year infrastructure budgets to the model that scores best on real-world software benchmarks, not conversational quality or creative writing. Those decisions involve eight-figure annual commitments, procurement timelines of six to twelve months, and switching costs that rise sharply once the model is embedded in CI/CD pipelines, internal tools, and knowledge-work workflows. When Anthropic ships a model that jumps 10.9 points on the benchmark the market has anchored to, it is not just winning a chart — it is repositioning for a contract cycle. The companies evaluating inference infrastructure this quarter will use Opus 4.7 scores as a key input, and some percentage of deals that would have defaulted to GPT-5.4 will now be reopened.

There is a structural reason the coding gap matters more than other benchmarks. Software engineering tasks on SWE-bench require the model to read a repository, identify the root cause of a reported bug or feature request, write a patch that passes the test suite, and do so without being told which files to touch or which functions to modify. It is a closer proxy to real agentic work than MMLU or GPQA because it requires multi-step reasoning under ambiguity across an existing codebase, not just retrieval or synthesis of known facts. An 11-point gap on that benchmark is not a rounding error. At 57.7%, GPT-5.4 was already passing the majority of professional-grade software tasks without human assistance. At 64.3%, Opus 4.7 has extended that frontier further, and the competitive pressure it places on OpenAI and Google to respond will shape the next three months of model releases across the industry.

Read the tape, then read the fine print

The headline benchmark is SWE-bench Pro at 64.3%, but the full scorecard is worth reading in detail because the distribution of wins and losses reveals where Anthropic chose to invest and where it implicitly conceded. The Next Web’s release-day breakdown provides the clearest table, and the pattern across seven major leaderboards points to a model tuned specifically for software engineers and knowledge workers, not for search-heavy consumer use cases.

On SWE-bench Verified, a curated subset of the full benchmark, Opus 4.7 scores 87.6%, up from 80.8% on its predecessor and comfortably ahead of Gemini 3.1 Pro’s 80.6%. The 7-point improvement from Opus 4.6 to Opus 4.7 on this metric alone would have been newsworthy in any prior cycle; paired with the Pro score, it demonstrates that the improvement is consistent across both the broader and the harder subsets of software engineering tasks. Terminal-Bench 2.0, which measures autonomous command-line task completion in real shell environments, lands at 69.4% — a number that matters specifically for infrastructure automation and DevOps workflows where models are increasingly being deployed without human oversight.

The agentic knowledge-work benchmark, GDPVal-AA, offers perhaps the most striking competitive separation. According to Vellum AI’s benchmark analysis, Opus 4.7 achieves an Elo score of 1753 on this table, which measures how well a model performs on tasks with real economic value — the kind of work that would otherwise require a skilled professional. GPT-5.4 scores 1674 on the same benchmark, and Gemini 3.1 Pro sits at 1314. The 439-point gap between Opus 4.7 and Gemini on knowledge work is larger than the entire gap between Gemini and a typical sub-frontier model, suggesting these are not interchangeable tools for enterprise document workflows. For legal analysis, financial modeling, multi-step research synthesis, and professional writing tasks, the Elo gap represents a meaningful difference in output quality that practitioners will notice on production workloads.

The vision upgrade is the less-discussed improvement that could prove more commercially significant over time. Artificial Analysis reports that Opus 4.7 processes images at 3.75 megapixels — three times the resolution of Opus 4.6 — and scores 79.5% on visual navigation benchmarks that require interpreting screen content, diagrams, and documents, up from 57.7% on its predecessor. That 22-point jump in visual reasoning is directly relevant to computer-use agents that must interpret UI elements, parse complex charts, and extract information from scanned documents. Amazon’s Bedrock launch post notes that the higher resolution specifically unlocks better performance on visual document analysis tasks — financial statements, technical schematics, and annotated PDFs — that have historically required dedicated OCR pipelines rather than a single multimodal call.

One proprietary quantified takeaway emerges when you stitch the benchmark improvements together: Opus 4.7’s average delta over Opus 4.6 across the five benchmarks where it improved is approximately 11.8 percentage points. That is the largest single-generation improvement Anthropic has shipped since the original Claude 3 Opus in early 2024. The trend is not narrowing — it is accelerating — and it suggests that Anthropic’s scaling investments, including the 3.5-gigawatt TPU deal with Google and Broadcom announced in early April, are starting to show up in measured capability rather than just training capacity. The model is the product of that compute runway, and the benchmark arc says the investments are compounding.

Where the model quietly concedes

The strengths above are real, but a benchmark table that is only read in one direction is a sales deck, not analysis. Opus 4.7 ships with a documented regression, a pricing premium that widens the already substantial cost disadvantage relative to Gemini, and a tokenizer change that can inflate costs in ways that are not obvious from the headline price. Buyers who understand these limits will deploy better; buyers who ignore them will get surprised on their first invoice.

The regression is on BrowseComp, a benchmark testing a model’s ability to conduct autonomous multi-source web research and synthesize findings from multiple live URLs. CNBC’s release coverage notes the drop plainly: Opus 4.7 scores 79.3% on BrowseComp, down from 83.7% on Opus 4.6. GPT-5.4 Pro scores 89.3% on the same table, and Gemini 3.1 Pro scores 85.9%. This is not a marginal degradation. In absolute terms, Opus 4.7 falls 10 points behind GPT-5.4 on web-research tasks — the mirror image of the SWE-bench lead — and 6.6 points behind Gemini, a model that costs roughly one-third as much per token. For any workflow where the primary use case involves research agents that browse live sources, aggregate information from news articles, or compile competitive intelligence from the open web, GPT-5.4 is the empirically stronger choice and Gemini 3.1 Pro is the more cost-efficient one. Opus 4.7 is not the right tool for every job, and honest evaluation requires acknowledging which category of work reveals that limit.

The pricing math deserves a table. Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens, identical to Opus 4.6. GPT-5.4 sits at $2.50 per million input and $15 per million output. Gemini 3.1 Pro costs $2 input and $12 output. On a pure cost-per-token basis, Opus 4.7 costs 2x the input price and 2.08x the output price of Gemini. For high-volume enterprise workloads processing millions of tokens per day, that delta is not an abstraction — it can represent hundreds of thousands of dollars per month. The benchmarks justify a premium for tasks where Opus 4.7 genuinely leads, but organizations deploying at scale need to calculate the premium against real task performance, not average benchmarks. A 64.3% score on SWE-bench Pro is a competitive edge that matters for software engineering agents; it does not justify a 2x cost premium for email summarization or FAQ generation where sub-frontier models perform equivalently.

The tokenizer change compounds the pricing issue in a subtle way. Tom’s Guide’s hands-on review documents Anthropic’s disclosure that Opus 4.7 uses an updated tokenizer that can increase input token counts by 1.0 to 1.35x relative to Opus 4.6 on the same input text. The model also has a documented tendency to “think harder” at high effort levels, which increases output token consumption. In practice, a prompt that cost $0.01 to run on Opus 4.6 could cost between $0.013 and $0.017 to run on Opus 4.7 for the same text, before accounting for longer output chains. Organizations migrating from Opus 4.6 without adjusting their cost models will see budget overruns in the first billing cycle that are not explained by changes in usage volume alone.

A broader critique deserves mention: the AI industry’s benchmark infrastructure is increasingly self-referential in ways that undermine the claims being made. SWE-bench was designed in 2023 and has been iterated, and the community of companies publishing against it has had more than two years to tune models specifically on its distribution of problems. Build Fast with AI’s comparative review raises the point directly: practitioner comparisons on real internal codebases frequently show smaller performance gaps between frontier models than the published benchmarks suggest. The benchmarks remain meaningful directional signals — a 64.3% versus 57.7% gap is too large to be purely an artifact of overfitting — but the gap in production for any specific codebase or task type will depend on factors the benchmark cannot capture: proprietary API conventions, internal documentation quality, and the idiosyncratic requirements of the task domain.

Building on Opus 4.7: your deployment checklist

The right question for any team evaluating Opus 4.7 is not “is it the best model?” — the answer is context-dependent — but “for which of my workloads does the evidence favor switching?” The benchmark data and pricing structure point to a clear answer for most enterprise deployments.

Route agentic coding tasks to Opus 4.7. For any workflow where a model must read, modify, test, or generate production code autonomously across a multi-file repository, Opus 4.7’s 64.3% SWE-bench Pro score is the single most relevant data point. The 6.6-point lead over GPT-5.4 and the 10-point lead over Gemini 3.1 Pro reflect a model that handles ambiguous requirements, correctly interprets existing code patterns, and navigates large codebases with fewer hallucinations. Organizations running Claude Code, Cursor, or custom coding agents should upgrade to Opus 4.7 for their flagship agentic tier and benchmark cost per accepted diff against the prior model before committing at scale.

Keep web-research agents on GPT-5.4. The BrowseComp regression is unambiguous. Any pipeline where the model’s primary operation is browsing live sources — news monitoring, competitive intelligence, literature review, regulatory tracking — should route to GPT-5.4 until Anthropic closes the gap. The 10-point lead GPT-5.4 holds on BrowseComp, combined with its lower output pricing at $15 versus $25 per million tokens, makes the economics clear for research-heavy workloads.

Use Gemini 3.1 Pro for cost-sensitive, high-volume tasks. Classification, summarization, extraction from well-structured documents, and FAQ-style queries do not require frontier capability. Gemini 3.1 Pro at $2/$12 performs within a few percentage points of Opus 4.7 on knowledge benchmarks for these workloads while costing roughly one-third as much. The 1 million token context window is identical. For volume workloads where cost per token drives profitability, Gemini is the rational choice.

Upgrade vision-dependent workflows. The 3.75-megapixel image processing upgrade makes Opus 4.7 the strongest generally available option for document analysis at scale. Legal teams processing scanned contracts, finance teams interpreting complex charts and tables, and operations teams working with technical diagrams should test Opus 4.7 directly against their current pipeline. The 22-point improvement on visual navigation benchmarks from Opus 4.6 is the kind of capability jump that often translates into measurable accuracy improvements on real-world document tasks where Opus 4.6 was borderline.

Account for the tokenizer shift before migrating. Run a representative sample of your highest-volume prompts through Opus 4.7 in a staging environment and compare token counts against Opus 4.6 before committing the migration budget. Add 20% to your cost estimate as a conservative buffer and measure actual consumption over a 72-hour window before extrapolating. Organizations that skip this step will face billing surprises; those that run the calibration first will have a defensible unit economics story for any internal approval process.

Evaluate the full context window. The 1 million token context window paired with 128K maximum output is the most capable long-context combination Anthropic has shipped. Tasks that previously required chunking — full repository analysis, end-to-end contract review, long-horizon research synthesis — can now run in a single pass. Measure the latency and cost tradeoff for your specific use case, because single-pass large-context calls can be more economical than the multi-call pipelines they replace, even at the higher per-token price.

The launch of Opus 4.7 does not simplify the model selection landscape. It complicates it productively. The frontier tier now has three genuinely differentiated flagships — Anthropic leading on coding and knowledge work, OpenAI leading on web research and computer use, Google leading on cost efficiency — and enterprise buyers who deploy all three in their appropriate contexts will outperform buyers who rely on a single vendor across every workload. The throne remains unoccupied, precisely because the race to fill it is producing better models faster than any single company can hold the lead.

In other news

Snap cuts 1,000 jobs as AI writes 65% of its new code — Snapchat’s parent company announced layoffs of 16% of its workforce on April 15, with CEO Evan Spiegel calling the moment a “crucible” for the company. In an internal memo, Snap disclosed that AI now generates more than 65% of all new code shipped — the immediate trigger for eliminating roles across software engineering, data science, and product management. The cuts are projected to reduce Snap’s annualized cost base by more than $500 million in the second half of 2026 (TechCrunch).

Nebraska Supreme Court suspends attorney over 57 AI hallucinations — Omaha attorney Greg Lake received an indefinite suspension from the Nebraska Supreme Court on April 15 after his appellate brief in a divorce case contained 57 defective citations out of 63, including 20 fully fabricated case references that do not exist in any jurisdiction. Lake initially told the court he had submitted the wrong file while traveling, then admitted to using AI. The court’s unanimous opinion is the first bar discipline action in the U.S. to result in full practice suspension over AI-generated filing errors (WOWT).

Anthropic secures 3.5 gigawatts of TPU compute from Google and Broadcom — A week before shipping Opus 4.7, Anthropic disclosed an expanded partnership with Google and Broadcom that will deliver approximately 3.5 gigawatts of next-generation TPU capacity starting in 2027, on top of roughly 1 gigawatt already committed for 2026. Mizuho analysts estimated that Broadcom would record $21 billion in AI revenue from Anthropic in 2026 and $42 billion in 2027 under the terms of the agreement — a figure that, if accurate, would make Anthropic Broadcom’s single largest AI customer (TechCrunch).

Meta building photorealistic AI avatar of Zuckerberg for employees — The Financial Times reported on April 14 that Mark Zuckerberg is personally spending five to ten hours per week developing a photorealistic AI clone of himself, trained on photos, voice recordings, and his public statements on company strategy, intended to “engage with employees in his stead.” If the internal deployment succeeds, Meta plans to commercialize the underlying technology to creators and public figures, according to the report (PetaPixel).