Microsoft Copilot Wave 3 Review: Multi-Model Orchestration Tested

Microsoft announced Wave 3 of Microsoft 365 Copilot on March 9, 2026, and it is the most structurally significant update to the platform since launch. Not because of flashy new chat features, but because the underlying architecture changed: Copilot now runs OpenAI and Anthropic models simultaneously, routes work across them automatically, and executes multi-step tasks across Teams, Word, Excel, and SharePoint without requiring a prompt for each app.

That is a real shift. Whether it's a shift that matters for your organization depends on two things: how deep your M365 footprint is, and whether your data governance is actually in order. Get both right and Wave 3 delivers measurable productivity gains. Get either wrong and you will spend more time cleaning up than moving forward.

This review covers what Wave 3 actually introduced, what works, what doesn't, how it compares to Google Gemini and Salesforce Einstein, and who should upgrade now versus wait for broader rollout.

TL;DR: Copilot Wave 3 is the strongest version of Microsoft's enterprise AI platform to date. The multi-model architecture (GPT + Claude running Critique) is technically compelling, and Cowork represents a genuine step toward AI-driven work delegation. The blockers are real: many features require the new E7 tier or Frontier Program access, Anthropic models are disabled by default in Europe, and deployments stall without solid data governance upfront. If your M365 estate is clean and you're on E5 or higher, the upgrade case is strong. Everyone else should wait for H2 2026 general availability.

What changed in Wave 3 (and why it matters)

Every prior wave of Copilot was reactive: you prompt, Copilot responds. Wave 3 inverts that. You define a goal; Copilot builds a plan, identifies which M365 resources it needs, shows you the plan for approval, then executes it autonomously across apps.

That is a meaningful architectural change, not a marketing reframe. It requires Copilot to maintain state across a multi-step workflow, coordinate between apps, and make decisions about which resources to touch. That is what the term "execution engine" refers to in Microsoft's own positioning: Copilot has moved from a fast typist to something closer to a delegated junior analyst.

The other structural change is the multi-model layer. Copilot now runs GPT-5.1 and Anthropic Claude on the same task. In the Critique feature, GPT drafts and Claude validates. In the Model Council feature, both produce independent responses and a third model synthesizes them. Microsoft VP Jared Spataro framed the motivation plainly at the Frontier Transformation event in March 2026: "Every 60 days at least, there's a new king of the hill. There's so much demand for a platform that doesn't feel like, 'I have to skip over to the next vendor.'"

The bet is that organizational context, governance, and workflow integration matter more than which single model is currently best. That bet is directionally correct. Whether the execution justifies the premium is what this review tests.

Microsoft Copilot Cowork interface showing autonomous task plan with SharePoint, Teams, and Planner steps before user approval — Cowork presents its execution plan for review before taking any action. Each step lists which M365 resource it will access.

The three flagship Wave 3 features

Copilot Cowork: autonomous task execution

Cowork is the headline feature and the one most likely to change how you think about AI in your organization.

You give Copilot a goal ("prepare for my Monday board meeting on Q2 performance"), and it builds a step-by-step plan: pull relevant emails from Outlook, retrieve Q2 reports from SharePoint, gather status updates from Teams channels, access project timelines from Planner, draft a briefing document in Word, and build a supporting deck in PowerPoint. It shows you the plan before executing, and presents checkpoints as it works.

Everything stays inside M365 security and governance boundaries. Every action is auditable through Purview. That is important: Cowork's value proposition is not just automation, it's automation within your existing compliance perimeter.

Cowork is built on Anthropic's Claude technology and is currently available to Frontier Program participants. Broader general availability is expected through H2 2026 for E7 and Copilot+ licensed customers.

In practice, Cowork's quality depends heavily on Work IQ, which is Microsoft's organizational intelligence layer. Work IQ connects to every Microsoft Graph signal: documents, meetings, emails, Teams messages, calendar history, and collaboration patterns. A well-indexed M365 environment produces genuinely useful Cowork plans. A disorganized SharePoint with permissioning chaos produces outputs you can't trust.

Copilot Critique: dual-model research validation

Critique is the most practically useful feature in Wave 3 for knowledge workers who use Researcher today.

The mechanism is straightforward: one model (typically GPT) drafts the research response; a second model (typically Claude) independently reviews it for accuracy, completeness, and citation quality before the output reaches you. Microsoft calls it the "dual brain" model, and the third-party benchmark data supports the claim. Critique scored 13.8% higher than single-model approaches on the DRACO benchmark, with a composite score of 57.4, the highest recorded for any research tool at the time of announcement.

Critique is now the default when you select "Auto" in Researcher. You don't need to configure anything. The practical result: Researcher outputs are more accurate, citations are better checked, and the hallucination rate drops meaningfully for complex research queries.

One honest caveat: Critique only works as advertised if GPT and Claude have genuinely different failure modes. If both models hallucinate in the same direction on the same question, the validation step adds little. Community commentary on Office Watch in April 2026 surfaced this precisely: "Multi-model orchestration is the right direction but the hard part isn't routing queries across models, it's knowing when to trust which model's output on the same question." That is a fair criticism. For most enterprise research tasks, the error profiles differ enough that Critique adds value. For highly specialized or obscure domains, verify outputs independently.

Model Council: simultaneous multi-model analysis

Council is the opt-in mode inside Researcher for deeper comparative analysis. Both GPT and Claude independently produce a full response to the same question. A third model then synthesizes both, identifying where they agree (high confidence), where they diverge (areas needing human review), and unique insights each model contributed.

The legal team use case is the clearest illustration: a team researching regulatory implications uses Council to generate two independent analyses, then uses the divergence points to focus attorney review time rather than re-reading everything from scratch. The AI handles the first pass across a large document set; humans focus where the models disagreed.

Council is slower and more resource-intensive than Critique. It is not the right default for everyday queries. For high-stakes research, policy analysis, or decisions where being wrong is costly, it is worth the extra time.

Copilot Researcher panel showing Critique mode active with dual-model validation indicator and Council synthesis view — Critique runs automatically in Researcher's Auto mode. Council requires opt-in and shows model agreement and divergence points explicitly.

Work IQ and Agent 365: the infrastructure underneath

Flagship features get the headlines, but the two infrastructure layers introduced in Wave 3 are what make the platform defensible long-term.

Work IQ: organizational context at scale

Work IQ connects all Copilot outputs to an organization's actual work history via Microsoft Graph. The practical difference: instead of a generic AI response, you get a response grounded in your organization's documents, your team's terminology, your previous decisions, and your collaboration patterns.

Three Work IQ capabilities ship with Wave 3. Work IQ Memory personalizes responses based on your own work and chat history over time. The Work IQ API (in public preview) exposes this intelligence layer to external developers via A2A, MCP, and REST interfaces, so custom agents can build on enterprise context without rebuilding data pipelines. Dataverse integration (also in preview) will extend Work IQ to Dynamics 365 and Power Apps operational data.

The strategic implication: the longer an organization uses M365 Copilot, the richer Work IQ's context becomes, and the more differentiated its outputs are from a generic AI model. This is the compound moat Microsoft is building. Bloomberg reportedly used Work IQ grounding to compress agent time-to-production from days to minutes on internal development tasks.

Agent 365: enterprise AI governance

Agent 365 reached general availability on May 1, 2026, at $15/user/month (also bundled in E7). It provides a centralized registry for every AI agent operating in a tenant: Copilot Studio agents, third-party marketplace agents, and Cowork workflows.

Access control runs through Entra ID, so agent permissions integrate with the identity infrastructure enterprises already manage. Real-time dashboards show what agents are doing across the tenant. Defender integration handles agent threat monitoring. A2A communication is supported for coordinated multi-agent workflows.

For organizations evaluating Wave 3 seriously, Agent 365 is not optional at scale. Without a registry, you accumulate shadow AI agents that no one can audit or revoke. With it, the governance overhead becomes manageable. Microsoft's phishing triage agent (built on Copilot Studio and connected to Defender data via M365 Agents SDK) delivered 6.5x efficiency gains for security analysts when deployed in a governed environment. That result does not happen in an ungoverned one.

Pricing: the E7 bundle and what you're actually paying

Wave 3's commercial model introduces a new pricing tier that simplifies licensing for some buyers and adds complexity for others.

Tier	Price	What's included
Microsoft 365 E3	~$36/user/month	Core M365 apps, limited Wave 3 features
Microsoft 365 E5	~$57/user/month	E3 + security, compliance tools
Copilot add-on	$30/user/month	Standard Copilot Chat, basic in-app Copilot
Agent 365 add-on	$15/user/month	Agent registry, governance, A2A
M365 E7 "Frontier Suite"	$99/user/month	E5 + Copilot + Agent 365 + Entra Suite + advanced Defender, Intune, Purview

The E7 math works in your favor if you are already holding M365 E5, standalone Copilot ($30/user/month), and Entra Suite licenses simultaneously. The consolidated SKU reduces friction and may reduce total cost. If you are on E3 with a standalone Copilot add-on and no Entra Suite, the jump to E7 is significant.

Important nuance: most of Wave 3's compelling features require E7 or active Frontier Program participation. Organizations on standard E3/E5 with legacy Copilot licenses receive a limited Wave 3 experience. Before drawing any cost-benefit conclusion, map exactly which features you actually need against which tier unlocks them.

Limitations worth knowing before you commit

Data governance is the prerequisite, not the nice-to-have

The most consistent barrier to Wave 3 value is organizations deploying before their Microsoft Graph permissions are clean. Cowork relies on Graph to access data across the tenant. Over 15% of business-critical files in a typical enterprise have over-provisioned permissions, according to industry research. 73% of enterprises discover critical data exposure risks after deploying Copilot, according to Copilot consulting specialists. Cowork can surface or act on data employees shouldn't see easily.

The fix requires a SharePoint permissions audit before Cowork deployment, Purview audit logging configured specifically for Cowork events, and an AI governance framework that accounts for autonomous agent actions. Governance-first deployments deliver roughly 3x better ROI outcomes in practice. The enterprises that stall on Copilot are almost always the ones that deployed first and tried to fix governance after the fact.

The Frontier Program creates a two-tier experience

Many of Wave 3's best features: Cowork, multi-model Chat with Claude, and Critique as the default in Researcher, are only available through the Frontier Program or E7 tier. An organization on standard M365 licensing comparing Wave 3 marketing to their actual Copilot experience will find a gap. Staggered rollout continues through Q2 2026 and beyond, and the schedule varies by feature, app, and region.

Anthropic models are disabled by default in Europe

For European organizations, Anthropic models are currently disabled by default in Microsoft 365 due to GDPR data residency requirements. This directly limits the practical impact of Critique, Council, and Cowork for a substantial portion of the global enterprise market. Microsoft is working on compliance solutions, but no confirmed GA timeline for European multi-model parity exists as of June 2026.

Multi-model complexity creates real trust questions

When Model Council produces conflicting outputs on the same research question, users need enough AI literacy to interpret the divergence meaningfully. For organizations without strong internal AI Champions, the complexity can hinder rather than help. Community commentary is pointed on this: "Introducing Critique: because when one AI hallucinates, the solution is obviously to add more AIs to vote on which hallucination sounds most professional." That is a fair read of the risk. The feature works when the models disagree for substantive reasons; it is less useful when users cannot distinguish meaningful divergence from noise.

Adoption numbers tell a cautionary backstory

Wave 3 launched against a difficult backdrop. Paid Copilot adoption sits at approximately 3.3% of the 450 million commercial M365 seats, around 15 million paid subscribers. Copilot's accuracy Net Promoter Score hit -24.1 in September 2025. Only 8% of enterprise users prefer Copilot over competitors when given a free choice. Wave 3's technical improvements are Microsoft's most direct response to those numbers. Whether Critique's 13.8% accuracy improvement shifts NPS materially remains to be seen.

How Copilot Wave 3 compares to the alternatives

Copilot Wave 3 vs. Google Gemini for Workspace

Dimension	Copilot Wave 3	Gemini for Workspace
Underlying model	GPT-5.1 + Claude (multi-model)	Gemini 3 Pro (single provider)
GPQA Diamond benchmark	GPT-5.1 at 88.1%	Gemini 3.1 Pro at 94.3%
Agentic capability	Cowork (multi-step, cross-app)	Gemini agents (maturing)
Multi-model orchestration	Yes, OpenAI + Anthropic	No
Ecosystem depth	Teams, Outlook, Word, Excel, PPT, SharePoint	Gmail, Docs, Sheets, Slides, Meet
Governance tooling	Agent 365, Purview, Entra ID	Google Workspace Admin
Context window	128K to 1M tokens (model-dependent)	Up to 1M tokens
Enterprise pricing	$30/user/month to $99/user/month (E7)	~$30/user/month to $60/user/month

The verdict is straightforward: Copilot Wave 3 is the clear choice for M365-native organizations with existing Teams, Outlook, and SharePoint workflows. Gemini leads for Google-native organizations and on raw benchmark performance. Gemini surpassed Copilot in paid subscriber share in late 2025, driven largely by aggressive Workspace bundling. Neither platform bridges the other's ecosystem without significant custom development.

For a deeper comparison, see our MCP vs A2A protocol comparison for context on how the underlying agent communication standards used by both platforms differ.

Copilot Wave 3 vs. Salesforce Einstein and Agentforce

The two platforms are not truly competitive in the same category. Einstein and Agentforce win in Salesforce-centric workflows: lead scoring, customer service automation, and CRM-native agent tasks. Copilot wins everywhere else in the M365 estate. Most enterprises run both. A2A communication now allows Copilot Studio agents and Salesforce Agentforce agents to delegate tasks to each other, which reduces the friction of operating across both ecosystems.

Copilot Wave 3 vs. Standalone Claude Enterprise

This comparison is genuinely ironic: Anthropic's Claude powers part of Copilot Wave 3's value proposition. For organizations that need the deepest document reasoning or coding capability on a standalone basis, Claude Enterprise may still outperform. For organizations that need AI deeply embedded in day-to-day M365 workflows with enterprise governance built in, Copilot Wave 3 (which includes Claude access) is the more practical choice. You don't need to choose between the two models; you get both through Wave 3.

Pros and cons

Pros

Multi-model orchestration (GPT + Claude running simultaneously) is technically differentiated and has no direct equivalent from any competitor in a productivity suite as of June 2026.
Critique's 13.8% DRACO benchmark improvement translates to meaningfully fewer hallucinations in Researcher, which matters for legal, compliance, and strategy teams.
Cowork's plan-then-execute model with human approval checkpoints is the right governance design for enterprise AI delegation: you retain oversight without losing the speed benefit.
Work IQ's organizational context compounds over time, making Copilot outputs progressively more differentiated from generic AI alternatives the longer you use it.
Agent 365 provides enterprise AI governance infrastructure that no other platform matches at this level of integration with identity and compliance tooling.
The E7 bundle simplifies licensing and can reduce total cost for organizations already holding E5 + Copilot + Entra Suite separately.

Cons

At 3.3% paid penetration across 450 million M365 seats, Copilot's pre-Wave 3 track record is a real risk signal: adoption challenges are structural, not just feature gaps, and Wave 3 does not eliminate the human change management problem.
Most compelling Wave 3 features require the Frontier Program or E7 tier; standard E3/E5 customers get a significantly reduced version of what the marketing describes.
Anthropic models are disabled by default in Europe due to GDPR constraints, which makes the multi-model value proposition unavailable to a large portion of enterprise customers until Microsoft resolves data residency compliance.
Governance prerequisites are non-negotiable: enterprises with messy SharePoint permissions, over-provisioned Graph access, or no Purview audit logging should not deploy Cowork until those are fixed, or risk data exposure at scale.
The E7 pricing at $99/user/month is a significant commitment; organizations not already on E5 face a steep cost jump to unlock the full Wave 3 feature set.

Who should deploy Wave 3 now

Deploy it if:

You are already on M365 E5 or E7 with reasonably clean SharePoint permissions and Purview logging configured.
You have knowledge workers who regularly do cross-app research synthesis (legal, strategy, finance, operations) and would benefit directly from Critique's accuracy improvements.
You are running 10+ Copilot Studio agents and need the governance infrastructure Agent 365 provides before agent count grows further.
Your organization is North American or APAC and can take full advantage of the Anthropic model integration without the European data residency constraint.

Wait if:

Your SharePoint permissions haven't been audited in the past 12 months. Cowork in a disorganized tenant is a data exposure risk before it's a productivity gain.
You're on M365 E3 with a standard Copilot add-on. The Wave 3 features you'll actually receive at that tier don't justify the marketing hype until broader GA rolls out through H2 2026.
You're an EMEA organization expecting multi-model Critique and Council capabilities. The Anthropic model availability question needs resolution before the core Wave 3 value proposition lands.
Your organization lacks internal AI Champions who can demonstrate workflows to non-technical employees. Wave 3 adds complexity; without adoption support, utilization stays low regardless of feature quality.

Is it worth it?

Wave 3 is the first version of Microsoft 365 Copilot that is genuinely difficult to dismiss. Multi-model orchestration is real and demonstrably better for research accuracy. Cowork is the right design for enterprise AI delegation. Work IQ is a strategic moat that compounds over time. Agent 365 gives IT the governance layer it needs to let AI agents operate at scale without creating shadow risk.

The blockers are also real. Most enterprises will not experience the full Wave 3 feature set until H2 2026. European organizations face the Anthropic availability constraint. And the adoption data from Wave 1 and 2 is a reminder that technical capability does not automatically translate to organizational utilization.

For M365-native organizations with governance infrastructure already in place, the upgrade path is clear: get on the Frontier Program or move to E7 and start with Critique and Work IQ before deploying Cowork. For everyone else, the practical step is to run a governance readiness assessment now so you're ready when broader GA arrives.

The platform's Copilot Studio ecosystem is already showing what sustained investment looks like: 160,000 organizations created more than 400,000 custom agents in the first three months after launch. That signals strong enterprise appetite for AI automation beyond the general-purpose assistant. Wave 3 gives that appetite the governance rails and model quality it needs to scale.

Pair this read with our Lovable vs Bolt vs v0 comparison if you are also evaluating AI-native development platforms outside the M365 ecosystem.

Frequently asked questions

Wave 3 is Microsoft's third major update cycle for Microsoft 365 Copilot, announced March 9, 2026. It introduces multi-model AI orchestration (running OpenAI GPT and Anthropic Claude simultaneously), Copilot Cowork for autonomous multi-step task execution, Copilot Critique for dual-model research validation, the Model Council for parallel independent analysis, Work IQ for organizational context grounding, and Agent 365 for enterprise AI governance. Most flagship features require the new M365 E7 Frontier Suite or participation in Microsoft's Frontier Program.

Cowork lets you give Copilot a goal instead of a single prompt. Copilot builds a step-by-step execution plan, identifies the M365 resources it needs (SharePoint, Teams, Planner, Outlook, Word, PowerPoint), shows you the plan for approval, and then executes it autonomously with progress checkpoints. All actions stay inside M365 security and governance boundaries and are auditable through Purview. Cowork is currently available through the Frontier Program and is built on Anthropic Claude technology.

Wave 3 features are available in Europe, but Anthropic models are currently disabled by default due to GDPR data residency requirements. This means European organizations cannot use Critique (dual-model validation), Model Council, or the full Cowork capability until Microsoft ships a compliant data residency solution. The core Copilot Chat and in-app editing features work normally for European users, but the multi-model orchestration that defines Wave 3's differentiation is not yet fully available in the region.

M365 E7 launched May 1, 2026, at $99/user/month. It bundles M365 E5 (core apps plus security and compliance), Copilot, Agent 365 (enterprise AI governance), Microsoft Entra Suite (identity), advanced Defender, Intune, and Purview into a single SKU. For organizations already holding M365 E5 plus standalone Copilot ($30/user/month) and Entra Suite licenses, E7 can simplify licensing and potentially reduce total per-user cost. Organizations on M365 E3 without Entra Suite face a larger cost increase.

Yes, with a specific caveat. The Critique feature (GPT drafts, Claude validates) scored 13.8% higher than single-model approaches on the DRACO benchmark, with a composite accuracy score of 57.4. That is a material improvement for enterprise research tasks in legal, compliance, finance, and strategy contexts. The caveat: Critique delivers its full benefit when GPT and Claude have different failure modes on the same question. For highly specialized or obscure domain queries, verify outputs independently regardless of the dual-model setup.