Claude Opus 4.7 Review: New #1 on LMArena, What Changed?
Claude Opus 4.7 reviewed: 64.3% on SWE-bench Pro, 3x vision upgrade, tokenizer trap, and how it compares to GPT-5.4 and Gemini 3.1 Pro.

Bytewaves Score Card
Claude Opus 4.7 shipped on April 16, 2026, and hit every benchmark expectation that weeks of developer speculation had built up. On SWE-bench Pro (the hardest public agentic coding benchmark currently in use) it scored 64.3%, up from 53.4% on Opus 4.6 and ahead of GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. On LMArena, it leads the coding sub-leaderboard at 1567 Elo. Partner reports from Cursor, Rakuten, CodeRabbit, Warp, and Notion confirmed the gains on real production workloads, not just synthetic benchmarks.
The story has a second half. The same release introduced a tokenizer change that can increase billable token counts by up to 35% for identical input text, three breaking API changes that require prompt retuning for anyone coming from Opus 4.6, and a wave of community reports about instruction-following regressions and safety over-caution that did not match the benchmark narrative. By the time Opus 4.8 shipped six weeks later, the 4.7 discussion had become a case study in the gap between benchmark performance and developer experience.
This review covers what actually changed, what the numbers mean and don't mean, and whether migrating from Opus 4.6 is worth the effort.
API migration required: Opus 4.7 has three breaking changes from Opus 4.6: extended thinking budgets removed (replaced by Adaptive Thinking), temperature/top_p/top_k parameters removed, and thinking content empty by default. Existing API integrations need to be audited before migrating. See the migration section below.
What benchmark position actually means in 2026
Before going through what changed, it is worth being precise about what "new #1 on LMArena" means in the current leaderboard environment.
LMArena uses blind pairwise voting across 327 models. Users compare two anonymous responses and pick the better one. The resulting Elo rating is widely considered the least-gameable independent ranking because it reflects real human preference across real prompts rather than vendor-controlled evaluations. Claude Opus 4.7's 1567 Elo on the coding sub-leaderboard is a genuinely meaningful number.
Two calibration points matter for interpreting it. First, LMArena rebranded from LMSYS Chatbot Arena in early 2026 and introduced methodology updates including refined Style Control filtering that shifted some Elo distributions by 20-40 points without reflecting real model quality changes. Raw Elo numbers across different methodology versions are not directly comparable. Second, the 30-Elo procurement rule: treat any gap under 30 points as within statistical noise. 60% of models within that band swap ranks within a single quarter.
On SWE-bench, two variants dominate the conversation. SWE-bench Verified (500 human-validated GitHub issues) gets the headline coverage: Opus 4.7 scored 87.6%, the first model to cross 87%. SWE-bench Pro (Scale AI's harder multi-file, multi-language variant with no public ground truth) is what serious developers watch because it resists benchmark-specific optimization. Opus 4.7 scored 64.3% on Pro. The roughly 23-point gap between Verified and Pro scores shows how much of any model's Verified performance reflects benchmark-specific optimization rather than general capability.
The 10.9-point improvement on SWE-bench Pro in a single release is the largest Anthropic has posted in a single Opus update. It holds after memorization screens were applied. That makes it the most credible part of the 4.7 story.

What actually changed in Opus 4.7
Adaptive Thinking replaces extended thinking budgets
The most significant architectural change is the replacement of extended thinking budgets with Adaptive Thinking. In Opus 4.6, developers could set a thinking.budget_tokens parameter to control chain-of-thought depth. Opus 4.7 removes this: the model decides reasoning depth based on task difficulty without explicit developer direction.
The practical effect is that easy tasks get less reasoning overhead and hard tasks get more, without requiring developers to tune a token ceiling. For most workloads this is the right default. For teams that used tight thinking budgets to control cost on high-volume workflows, it is a behavioral change that requires testing.
Hex's CTO noted that "low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6," meaning the entire capability curve has shifted upward. The new xhigh effort level (slotted between the old high and max) scores 71% at 100k tokens on internal benchmarks, already ahead of Opus 4.6's max at 200k tokens.
Vision: 3x resolution increase
Opus 4.7 processes images up to 2,576 pixels on their longest edge, roughly 3.75 megapixels, compared to approximately 1.15 megapixels previously. This is an architecture-level change with no API toggle required. Every image sent to the API is now processed at full resolution.
The impact on computer-use agents is substantial. On XBOW visual acuity tests, the model jumped from 54.5% to 98.5% success rate. Desktop automation agents that previously failed on 4K interfaces or dense PDF forms now have sufficient fidelity to operate accurately. Computer-use coordinates map 1:1 with actual pixels, eliminating the scale-factor translation code previously required for agent navigation on high-DPI displays.
Proactive output verification
This behavioral change matters more to day-to-day agentic coding than any benchmark number. Opus 4.7 proactively verifies outputs before declaring a task complete. In coding contexts, this means the model writes tests, runs them, and fixes failures before surfacing results rather than presenting output and waiting to be told it failed.
Notion flagged Opus 4.7 as the first model to pass their implicit-need tests: tasks where the model must infer required tools and actions without being told explicitly. Vercel reported the model runs proofs on systems code before starting work. Stripe noted it catches its own logical faults during planning. Each report describes the same behavioral shift: the model is now a participant in QA, not just a code generator.
Devin described the model as working "coherently for hours" on problems that previous Opus versions would quit on. For an autonomous software engineering agent, multi-hour task coherence without context drift is a foundational capability, not a nice-to-have.
Task Budgets (API beta)
The API introduces task budgets in public beta via the task-budgets-2026-03-13 beta header. This allows developers to set a hard token ceiling on agentic loops. The model sees a running countdown and prioritizes work to finish gracefully as the budget is consumed rather than cutting off mid-task or generating an unexpected bill.
For teams running autonomous agents on variable-complexity tasks, this is the cost management feature that makes production deployment predictable. Without a budget ceiling, a complex debugging session on a large codebase can generate 10x more tokens than a simple refactor, with no reliable way to bound the cost in advance.
1M-token context (beta) and Claude Code improvements
The 1 million token context window ships in beta. The standard 200k context handles roughly 150,000 words; the 1M context handles the equivalent of a full medium-sized codebase or a year of email archives in a single context.
Claude Code receives two notable updates: /ultrareview spins up a dedicated review session that flags bugs, design issues, and structural problems in structured output (Pro and Max users received 3 free ultrareviews at launch). Auto Mode extends to Max plan users, using a classifier to determine when to run safe actions without interruption and when to prompt for approval. For developers running long agentic sessions, the difference between approving 47 tool calls and approving 4 is the practical difference between supervising Claude and letting it work.
The tokenizer change: what it actually costs
The most practically consequential change for existing users appears nowhere in the feature list. The new tokenizer maps the same input text to 1.0-1.35x more tokens than Opus 4.6, at the same sticker price of $5 input / $25 output per million tokens.
For Claude Pro users on message-limited plans, this means hitting daily caps faster on identical prompts. For API teams on volume billing, it is an effective cost increase that does not appear in pricing announcements.
The offset argument is workload-dependent. Box's evaluations found a 56% reduction in model calls and a 50% reduction in tool calls compared to Opus 4.6 on certain agentic workflows, meaning the model completes complex tasks in fewer steps. If your primary workload involves complex multi-step tasks where Opus 4.6 required multiple retries or tool-call loops, the efficiency gains plausibly offset the tokenizer increase. If your workload is high-volume, short-context requests where Opus 4.6 already succeeded on the first pass, the tokenizer change is a straight cost increase with no corresponding benefit.
The recommendation from Anthropic and the developer community is consistent: measure token usage on your actual traffic before migrating at scale. Do not extrapolate from benchmark numbers.
The three breaking API changes
Teams migrating from Opus 4.6 need to address three breaking changes before putting 4.7 into production.
Extended thinking budgets removed. The thinking.budget_tokens parameter is deprecated. Adaptive Thinking handles reasoning allocation automatically. For teams that used tight budgets for cost control, test your actual cost per request under the new defaults.
Temperature, top_p, and top_k removed. These sampling parameters no longer function. Use prompting instead: instruct the model directly to be more or less precise, creative, or conservative rather than adjusting sampling configuration.
Thinking content empty by default. Opus 4.6 surfaced chain-of-thought content in responses by default. Opus 4.7 hides it unless you opt in with display parameters. Any eval harness, logging pipeline, or downstream tooling that parses thinking content needs to be updated before migration.
The migration is manageable but not trivial. Teams with tuned eval harnesses, prompt libraries optimized for Opus 4.6 behavior, or downstream tooling built around specific output formats should allow real testing time before the switch.
Benchmark comparison: Opus 4.7 vs. GPT-5.4 vs. Gemini 3.1 Pro
| Benchmark | Claude Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| SWE-bench Pro | 64.3% | 57.7% | 54.2% |
| SWE-bench Verified | 87.6% | ~82% | 80.6% |
| CursorBench | 70% | n/a | n/a |
| BrowseComp | 79.3% | 89.3% | 85.9% |
| GPQA Diamond | 94.2% | 94.4% | 94.3% |
| MCP-Atlas | 77.3% | n/a | 73.9% |
| Finance Agent | 64.4% | n/a | 59.7% |
| MMMLU (multilingual) | 91.5% | n/a | 92.6% |
| Input price / 1M tokens | $5.00 | $2.50 | ~$3.00 |
| Output price / 1M tokens | $25.00 | $15.00 | ~$15.00 |
| Context window | 1M (beta) | ~1.05M | 2M |
The pattern is consistent. Opus 4.7 leads on coding, agent, and document-analysis benchmarks. GPT-5.4 leads on web research (BrowseComp 89.3% vs 79.3%) and carries the lowest price for short-to-medium prompts. Gemini 3.1 Pro is approximately 60% cheaper across the board, leads on multilingual tasks (MMMLU 92.6% vs 91.5%), and offers the largest published context window at 2M tokens.
Graduate-level reasoning (GPQA Diamond) has saturated at the frontier. All three models sit within noise at approximately 94%. Differentiation has shifted to task-specific metrics. On the coding tasks that matter to the developers this model is designed for, Opus 4.7 leads by a meaningful margin.
The cost gap is the clearest challenge for Opus 4.7's value case. A workload that costs $50/day on Opus 4.7 costs approximately $17.50 on GPT-5.4 for short prompts and under $20 on Gemini 3.1 Pro. Most production teams have responded to this by routing: Opus 4.7 for maximum-quality coding tasks, Sonnet 4.6 or Haiku for volume work.
Where open-source fits
DeepSeek V3.2 and open-weights models are now within 30 Elo of the frontier on general tasks. Qwen3.6 Plus (Alibaba, April 2026) scored 78.8% and Muse Spark (Meta) scored 77.4% on SWE-bench Verified, closing the open-weights gap to single digits. At scale the cost differential is stark: a workload costing $50/day on Opus 4.7 costs approximately $2.80 on Gemini 3.1 Pro Flash-tier or under $1 on a self-hosted open-weights model. For teams with the infrastructure to self-host, the TCO break-even against cloud frontier models has fallen from 18 months to approximately 4-6 months.
What developers actually reported
The partner quotes in Anthropic's launch post were uniformly positive across a broad set of real engineering tools: Cursor, Rakuten, CodeRabbit, Warp, Devin, Vercel, Stripe, Notion, Linear, Bolt, Hex. The performance claims are specific and credible: Cursor's CEO reported the CursorBench number directly. Rakuten's 3x production task resolution comes from their own internal benchmark. CodeRabbit's +10% recall with stable precision is a real-workload measurement.
The developer community split more sharply. Within the first week of release, reports circulated on Reddit, X, and Hacker News describing degraded instruction-following on multi-file edits, broken agentic behavior in Claude Code sessions, and a perception that the model was refusing more requests or adding hedging language where previous Claude versions executed directly. One Hacker News commenter, writing about Opus 4.8 but reflecting on 4.7, described genuine difficulty identifying capability improvements over Opus 4.5 despite the benchmark claims.
The split is explicable and both sides are accurate. Partner benchmark data reflects specific, well-defined workloads where Opus 4.7's improvements are cleanest. Community reports reflect heterogeneous workflows where the new model's stricter literal-interpretation defaults and changed safety thresholds create friction. These are different workload types producing different experiences, not contradictory evidence.
One process criticism is valid regardless of model quality: Anthropic published no release notes, changelog entry, or public acknowledgment of behavioral changes for Opus 4.7. Developers had no way to distinguish a model regression from a safety classifier update from normal model variation. This made the community response harder to navigate and represents a communication gap worth noting for teams managing prompts tuned against specific model behavior.
Opus 4.7 vs. Opus 4.6: the migration decision
For coding-heavy teams doing complex agentic work: the capability gains are real and cross-validated on partner production data. The 10.9-point gain on SWE-bench Pro, proactive output verification, the vision resolution upgrade, and the xhigh effort level represent a meaningful step. If your workload resembles what Cursor, Rakuten, Warp, and Devin describe, including multi-step autonomous coding, complex debugging, and high-DPI computer use: the upgrade pays off in fewer retries and less human review time.
For teams with well-tuned Opus 4.6 workflows on shorter, structured tasks: the migration cost is real and the benefit is less clear. Three breaking changes require retesting. The tokenizer change may increase costs without a workload benefit. The instruction-following regression reports are most likely to affect structured, rule-heavy prompts rather than open-ended reasoning.
The practical migration approach: canary Opus 4.7 on a slice of production traffic, measure token usage against Opus 4.6 baselines, and audit every API call using the three deprecated parameters before full cutover.
One additional consideration: Opus 4.8 shipped on May 28, 2026, six weeks after 4.7, at the same $5/$25 pricing with a focus on reliability and code quality. Opus 4.7 remains available with no published retirement date, but teams evaluating the Opus tier fresh today are looking at 4.8, not 4.7.
Pros and cons
Pros
- Leads all publicly available models on SWE-bench Pro (64.3%) and SWE-bench Verified (87.6%), with gains cross-validated on real production workloads by independent partners at Cursor, Rakuten, CodeRabbit, and Warp
- Proactive output verification is a behavioral shift that materially reduces human review cycles in agentic coding: the model writes tests, runs them, and fixes failures before surfacing results
- 3x vision resolution increase (1.15MP to 3.75MP) enables computer-use agents to operate accurately on 4K interfaces and dense PDF forms for the first time
- Task Budgets beta provides hard token ceilings on agentic loops, making production cost predictable for variable-complexity autonomous tasks
- Rapid release cadence (four Opus updates in 12 months at the same price) means capability compounds without pricing escalation on existing integrations
Cons
- Three breaking API changes from Opus 4.6 require auditing and retesting any integration using thinking budgets, sampling parameters, or thinking content in downstream tooling
- Tokenizer change inflates billable token counts by up to 35% for identical input text at the same sticker price. The effective cost increase is hidden in usage, not visible in pricing
- Meaningfully more expensive than GPT-5.4 ($5 vs $2.50 input) and Gemini 3.1 Pro (~60% cheaper) for general workloads where the coding benchmark gap is less relevant
- Community-reported instruction-following regressions and safety over-caution on structured prompts are not reflected in benchmark scores and represent a real developer experience gap
- BrowseComp performance (79.3%) lags GPT-5.4 (89.3%) and Gemini 3.1 Pro (85.9%). Not the right model for web-research-heavy workloads
- Superseded by Opus 4.8 six weeks after launch; teams evaluating the Opus tier now should start with 4.8
Bottom line
Opus 4.7 is the best publicly available coding model at launch, with genuine benchmark gains cross-validated on partner production data. The proactive verification behavior and vision upgrade are real capability improvements, not benchmark artifacts.
The value rating (3.8 out of 5) reflects two specific problems: the tokenizer change that obscures a real effective cost increase, and a price premium that requires justification against workloads that genuinely benefit from best-in-class coding performance. For teams running complex agentic coding, the premium is justifiable. For general-purpose AI workloads, GPT-5.4 and Gemini 3.1 Pro offer competitive quality at meaningfully lower cost.
For teams evaluating the Opus tier today: test Opus 4.8. For teams currently on Opus 4.6 doing complex agentic work: the migration is worth the effort. For teams on structured, rule-heavy workflows: measure the tokenizer impact and test against the three breaking changes before committing.
For broader context on how Claude fits into agentic development stacks, see the Cursor 3 review and the open-source AI coding tools comparison on Bytewaves.
Frequently asked questions
Claude Opus 4.7 scores 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro. The Verified score made it the first model to cross 87% on that benchmark at launch. SWE-bench Pro is the more meaningful number for evaluating agentic coding performance: it uses multi-file, multi-language repositories with no public ground truth leakage, making it harder to optimize for specifically. The 10.9-point Pro improvement from Opus 4.6 (53.4% to 64.3%) is the largest single-release gain Anthropic has posted on that benchmark and holds after memorization screens were applied.
Three breaking API changes: (1) Extended thinking budgets removed: the thinking.budget_tokens
parameter is deprecated, replaced by Adaptive Thinking where the model allocates reasoning depth
automatically. (2) Temperature, top_p, and top_k parameters removed. Use prompting to guide model
behavior instead of sampling parameters. (3) Thinking content empty by default: opt in with
display parameters if your pipeline parses chain-of-thought output. Any eval harness, logging
pipeline, or downstream tooling built on Opus 4.6-specific parameter behavior needs to be audited
before migrating.
Yes, potentially by up to 35% for identical input text at the same sticker price. The offset is workload-dependent: Box found 56% fewer model calls and 50% fewer tool calls on certain agentic workflows, which can offset the higher per-token count for complex multi-step tasks. For high-volume, short-context requests where Opus 4.6 already succeeded on first attempt, the tokenizer change is a straight cost increase with no benefit. Measure actual token usage on your real traffic before migrating at scale.
Opus 4.7 leads on coding: 64.3% vs 57.7% on SWE-bench Pro, 87.6% vs ~82% on SWE-bench Verified. GPT-5.4 leads on web research: 89.3% vs 79.3% on BrowseComp. Both score approximately 94% on GPQA Diamond (statistical tie). GPT-5.4 is substantially cheaper: $2.50 input vs $5 for Opus 4.7, $15 output vs $25. For coding-first and agentic workloads, Opus 4.7 leads. For web-research-heavy or cost-sensitive pipelines, GPT-5.4's price advantage is difficult to overlook.
For teams evaluating the Opus tier fresh, start with Opus 4.8, which shipped May 28, 2026, at the same $5/$25 pricing with reliability improvements and better code quality defaults. Opus 4.7 remains available with no published retirement date, but there is no reason to target it as a migration destination when 4.8 is available. For teams currently on Opus 4.6 doing complex agentic coding work, the upgrade to either 4.7 or 4.8 is worth the migration cost. For structured, rule-heavy workflows, test the tokenizer impact and breaking API changes before committing to either.


