Meta Muse Spark: What Developers Need to Know (2026)
Muse Spark is Meta's first proprietary frontier model (April 2026): benchmarks, reasoning modes, no public API yet, and what it means for Llama developers.

Meta spent nine months telling the world it was rebuilding from scratch. On April 8, 2026, the bill came due - and what arrived was Muse Spark, the first model from Meta Superintelligence Labs.
The short version: this is Meta's most capable model, it's closed source, and there's no public API yet. If you were hoping to build on it today, you can't. If you want to understand what it actually does and what the closed-source move means for your stack - keep reading.
TL;DR: Muse Spark is Meta's first proprietary frontier model, scoring 52 on the AI Intelligence Index with strong multimodal and health benchmarks. No public API exists as of June 2026. Llama 4 weights remain available, but Meta's best development has moved to the proprietary Muse series.
What Muse Spark actually is
Muse Spark is the first model from Meta Superintelligence Labs (MSL), the research division Meta built around Alexandr Wang after a $14.3 billion deal to bring him on board from Scale AI. The model was internally codenamed "Avocado" and represents a full ground-up rebuild of Meta's AI stack - new architecture, new infrastructure, new data pipelines - completed in nine months.
It's natively multimodal, meaning it processes text, images, and audio in a single architecture rather than bolting a vision module onto a text model. It supports tool use, visual chain-of-thought, and multi-agent orchestration out of the box.
The benchmark headline: a score of 52 on the Artificial Analysis Intelligence Index v4.0, placing it fifth overall behind GPT-5.4 (57), Gemini 3.1 Pro (57), Claude Opus 4.6 (53), and Claude Mythos Preview. For a model built by a team that didn't exist a year ago, that's a real result.

The three reasoning modes
Muse Spark ships with three distinct interaction modes, and they're not just marketing labels - they correspond to meaningfully different inference behaviors.
Instant mode
Fast completions with no deliberation overhead. This is what powers the conversational surface inside Meta AI apps. If you're asking a factual question or need a quick summary, this is the path. Think of it as the equivalent of OpenAI's standard completions endpoint - low latency, no extra compute.
Thinking mode
The model pauses before responding, works through the problem step by step, and then produces its answer. Useful for math, code review, and multi-step analysis. Meta's benchmarks show meaningful accuracy gains on reasoning tasks compared to Instant mode, though the exact compute overhead hasn't been published.
Contemplating mode
This is the notable one. Contemplating mode spawns multiple agents that reason in parallel, then synthesizes their outputs. Meta says it scores 58% on Humanity's Last Exam with tools enabled and 38% on FrontierScience Research - results that compete with Gemini Deep Think and GPT Pro modes. Contemplating mode is rolling out gradually on meta.ai and wasn't available at general launch.
Whether these modes will be individually addressable via API parameters - the way Anthropic exposes extended thinking budgets or OpenAI exposes reasoning effort levels - hasn't been stated yet.

Where Muse Spark is strong, where it isn't
Meta was unusually candid in the launch post about where the model falls short. That's worth taking seriously, because it's the kind of transparency that makes benchmark claims more credible.
Strong:
- Multimodal perception - tops CharXiv Reasoning (86.4) and ZeroBench (33.0); ranks second on visual STEM and entity recognition tasks
- Health-informed responses - the best result on HealthBench Hard (42.8) across all frontier models, developed with reported input from over 1,000 physicians
- Efficiency - delivers frontier-class reasoning in under half the output tokens that Claude Opus 4.6 and GPT-5.4 use on the same benchmarks
Weak:
- Coding - trails GPT-5.4 and Gemini on SWE-Bench and LiveCodeBench Pro; Meta explicitly listed coding workflows as a current performance gap
- Long-horizon agentic tasks - the GDPVal-AA Elo of 1444 puts it last among the top five frontier models; Terminal-Bench scores lag competitors
- Abstract visual reasoning - scores 42.5 on ARC AGI 2 versus Gemini's 76.5; a significant gap
For developers evaluating models for coding agents or complex agentic pipelines, those last two points matter more than the health benchmarks.
The open-source question
This is the part that actually changes the calculus for teams building on Meta's ecosystem.
Muse Spark is proprietary. That's a deliberate break from the Llama lineage that made Meta one of the most developer-friendly AI companies in the world. Llama 4 still exists, and the weights are still downloadable - nothing about the Muse Spark launch retroactively changes that. But Meta's most capable, most actively developed model is now closed.
Alexandr Wang said Meta "hopes to open-source future versions" of Muse Spark. That's not a commitment and there's no timeline. Given the commercial context - Meta is spending between $115 and $135 billion on AI capex in 2026 - treating the Muse series the way you'd treat Llama weights would run against the current incentive structure.
For teams who chose Llama specifically because it kept pace with commercial models and stayed deployable on-premises, the trajectory has shifted. Llama may continue to receive updates, but it no longer represents where Meta's best work goes.
Developer access: where things stand now
As of June 2026, Muse Spark is available in two places:
- Meta AI consumer apps - meta.ai, the Meta AI mobile app, and rolling out across WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban AI glasses
- Private API preview - available to unspecified "select partners" only
There is no public API, no announced pricing, no published endpoint format, and no stated rate limits. Meta has confirmed plans to eventually offer paid API access to a wider audience, but has given no timeline.
If you need a model you can build production integrations against today, Muse Spark isn't that yet. The practical alternative for multimodal reasoning remains the Claude, Gemini, and GPT-5.4 APIs - all of which accept image input and have published pricing and SLAs.
What this means for your stack
The Muse Spark launch is less a developer story right now and more a signal about where the frontier is moving. Three things are worth tracking:
Efficiency benchmarks are getting competitive. Meta claims to produce frontier-class outputs at a fraction of the token cost of other top models. If that holds under independent testing, it'll put pressure on per-token pricing across the market. Cheaper inference benefits everyone building at scale.
Agentic mode design is converging. Contemplating mode, Gemini Deep Think, GPT Pro, and Anthropic's extended thinking all point in the same direction: multi-step, parallel reasoning is becoming the expected interface for hard problems. If you're designing agent pipelines now, build for reasoning-effort parameters - they're becoming standard.
The open-weight tier is weakening. Llama was a serious contender for local deployment because it tracked commercial models closely. That advantage is narrowing. If your architecture depends on open weights for cost, compliance, or privacy reasons, start pressure-testing your Llama 4 assumptions against a world where the open-weight alternative consistently lags the API frontier.
For a broader look at how agentic reasoning is reshaping developer tools right now, our breakdown of guardian agents in CI/CD pipelines covers what these reasoning improvements mean in practice. If you're evaluating AI coding tools in the context of this shift, the Cursor 3 review covers how parallel agents are already changing the workflow. More model and tooling coverage lives in our AI tools category.
Why it matters now
Meta isn't competing with OpenAI and Anthropic the same way those companies compete with each other. OpenAI and Anthropic sell access to developers and enterprises. Meta deploys directly to over three billion people already inside its apps every day.
If Muse Spark performs at the level its benchmarks suggest, the next stage of competition won't be purely about benchmark rankings. It'll be about which platform can connect AI to the most relevant user context and the widest real-world distribution. Meta has an argument to make on that front that no other AI lab can make.
For developers, the practical horizon is: watch the API rollout. When public access opens with published pricing and a stable endpoint, Muse Spark's multimodal and health capabilities will be worth evaluating seriously - particularly for consumer applications that run inside Meta's ecosystem.
The takeaway
Muse Spark is the most capable model Meta has shipped, built from scratch in nine months by a team that didn't exist before Alexandr Wang arrived. The benchmarks are real, the efficiency story is credible, and the distribution advantage is unlike anything a pure API company can match.
The gap for developers is real too: no public API, acknowledged weaknesses in coding and agentic tasks, and an open-source trajectory that's no longer guaranteed. Watch the API rollout and treat the Llama open-weight tier as a separate, slower-moving track from here.
Frequently asked questions
Not publicly yet, as of June 2026. Meta is offering private API preview access to select partners only. A broader paid API rollout is planned, but no pricing or timeline has been announced. If you need a multimodal API today, Claude, Gemini, and GPT-5.4 all have public endpoints with published pricing.
No. Muse Spark is proprietary, which is a departure from Meta's Llama lineage. Alexandr Wang said Meta hopes to open-source future versions, but gave no timeline or commitment. Existing Llama 4 weights remain available and aren't affected by this launch.
Meta explicitly acknowledged two gaps: coding workflows and long-horizon agentic tasks. On SWE-Bench and Terminal-Bench, it trails GPT-5.4 and Gemini. Its GDPVal-AA Elo of 1444 places it last among the top five frontier models on agentic evaluations. For coding agents or complex pipelines, those gaps matter more than the strong health and multimodal results.
