Nvidia Cosmos 2.5: World Foundation Models for Physical AI Explained

Training a robot to pick up an unfamiliar object takes thousands of real-world attempts. Training an autonomous vehicle to handle rare edge cases, like a car drifting the wrong way on a highway ramp at night, can take years of footage and cost millions to collect safely. Nvidia's answer to both problems is the same: generate the world instead of driving through it.

Cosmos 2.5 is Nvidia's latest release in its World Foundation Models (WFMs) family. It ships two core models, Cosmos Predict 2.5 and Cosmos Transfer 2.5, that together let teams produce physically coherent synthetic video at a scale and quality that was impractical a year ago. Here is what each model actually does, how they fit together, and why the approach is becoming a standard part of physical AI pipelines.

TL;DR: Cosmos Predict 2.5 generates synthetic video worlds from text, images, or prior frames. Cosmos Transfer 2.5 transforms simulator output into photorealistic footage for training. Both are open-weight, available on Hugging Face, and post-trainable for robotics or autonomous vehicle tasks.

What "world foundation models" actually means

The phrase sounds abstract, so start with the problem it solves.

Language models learn patterns across text. Vision models learn patterns across images. World foundation models learn patterns across time: what happens next in a physical environment, given what is visible now. They are trained on massive video datasets and develop an internal model of how objects move, how light changes, and how actions cause consequences in three-dimensional space.

That internal model is what makes them useful for simulation. Instead of hand-building a physics engine and populating it with 3D assets, a team can prompt a world model to generate plausible footage of a specific scenario. The robot arm tries the grasp. The car encounters the wrong-way driver. The warehouse forklift rounds a blind corner. All in synthetic video that the AI policy can learn from, without anyone getting hurt or spending three months filming it.

How world foundation models feed physical AI pipelines

Real-world video data (35M hours raw)
        │
 Curation & filtering
        │
200M high-quality clips
        │
┌───────┴────────┐
│  Cosmos Predict │  ← generates new world scenarios
│      2.5        │     (Text / Image / Video → Video)
└───────┬────────┘
        │  synthetic video output
┌───────┴─────────┐
│ Cosmos Transfer  │  ← makes simulator output photorealistic
│      2.5         │     (Sim render → realistic video)
└───────┬─────────┘
        │
Robot / AV policy training
        │
 Real-world deployment

Cosmos 2.5 sits in the middle of that chain. It does not replace real data entirely. It multiplies it.

Cosmos Predict 2.5: one model instead of three

Cosmos Predict 2.5 merges what were previously three separate models, Text2World, Image2World, and Video2World, into a single unified architecture capable of generating consistent, controllable video worlds from multiple input modalities.

That consolidation matters practically. Before, a team building a robotics pipeline had to maintain separate checkpoints for each generation mode and manage the handoffs between them. With 2.5, a single model handles all three. You can prompt it with a text description, a single reference frame, or a short seed video, and it continues or extends the world from there.

The training data and architecture

The flow-based model was trained on 200 million high-quality video clips curated from a pipeline that processes 35 million hours of raw video and produces over 6 billion clips before filtering. The filtering step is what matters most here: raw internet video contains plenty of physically implausible motion, compression artifacts, and discontinuous cuts. The curated subset is what teaches the model what real physical causality looks like.

Built on a flow-based architecture, Cosmos Predict 2.5 leverages Cosmos Reason 1, a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, it achieves substantial improvements over Cosmos Predict 1 in video quality and instruction alignment.

The reinforcement learning step is newer and worth noting. Rather than only training on next-frame prediction, Nvidia added an RL-based post-training stage to improve how well the generated video follows the text prompt. The result is tighter alignment between what you ask for and what you get.

What developers can generate

Cosmos Predict 2.5 produces sequences up to 30 seconds, maintaining spatial-temporal coherence, which is important for simulation, long-horizon prediction, and robotic planning. It also creates synchronized camera views for realistic multi-camera setups in autonomous vehicle training or robot vision with camera control.

The 30-second horizon is significant. Most diffusion-based video models top out at a few seconds before temporal drift accumulates and the scene falls apart. Maintaining coherence across 30 seconds at 1280x720 and 16 fps means long maneuver sequences, multi-step manipulation tasks, and extended AV driving scenarios are all tractable from one generation call.

Models are released at 2B and 14B parameter scales under the NVIDIA Open Model License. Partners including 1X, Figure AI, Agility Robotics, Uber, and Waabi are already using Cosmos models for synthetic data generation and policy evaluation.

Cosmos Transfer 2.5: turning simulator output into training data

Predict 2.5 generates scenarios from scratch. Transfer 2.5 solves a different, equally painful problem: the sim-to-real gap.

Modern simulators like Nvidia Isaac Sim or CARLA can render physically accurate robot and vehicle environments, but they still look synthetic. Neural networks trained on simulator footage struggle to generalize to real camera footage because the visual distribution is different. Transfer 2.5 bridges that gap by taking a simulator render and generating a photorealistic version of it, while keeping the underlying geometry, motion, and labels intact.

How the sim-to-real translation works

The CosmosWriter captures synchronized RGB, depth, segmentation, and edge data from a robot navigating a simulated environment. The generated data serves as ground truth input for Cosmos Transfer, which transforms low-resolution control signals into high-quality visual simulations through its Multi-ControlNet architecture.

The Multi-ControlNet architecture is what preserves the structured information. Rather than letting the model freely hallucinate appearance, it conditions generation on the depth map, segmentation mask, and edge detection output from the simulator. The visual style changes. The spatial layout does not.

From a single simulator scenario, Cosmos Transfer 2.5 generated 18 distinct augmentation variations, significantly expanding training data diversity without additional manual effort in the simulator. That 18x multiplier is the practical pitch. You run the sim once, get the ground truth, then generate a range of weather conditions, lighting changes, and surface textures from that single pass.

Autonomous vehicle improvements

The evaluation of 3D lane and cuboid detection on generated multi-view videos, using real-world scenarios as the control input, shows up to a 60% improvement over the previous model (Transfer 1-7B), using LATR for lane detection and BEVFormer for cuboid detection.

Transfer 2.5 is 3.5x smaller than its predecessor yet faster and higher quality, optimized for deployment in both research and production pipelines. Going from 7B to 2B parameters while improving output quality and AV detection accuracy is the kind of efficiency gain that makes a model practical outside of a well-funded research lab.

The rest of the Cosmos platform

Predict and Transfer are not the only components in the 2.5 release. Two supporting pieces make them more useful in practice.

Cosmos Reason 1 is the 7B-parameter vision-language model used internally by Predict 2.5 as its text encoder. It enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding, and common sense to understand and act in the real world. Cosmos Reason 1 has topped the Physical Reasoning leaderboard on Hugging Face. You can also deploy it as a standalone model via NVIDIA NIM microservices.

Cosmos Dataset Search handles the upstream data problem. It is a vector-based workflow that enables physical AI developers to instantly search and retrieve targeted scenarios from massive training datasets. Using the Cosmos Embed NIM, it enables highly accurate semantic search and connects to NVIDIA Cosmos Curator to refine datasets and retrieve queried data, with the ability to search billions of clips in seconds. The pitch from Nvidia is that post-training cycles that previously took years can now take days, because you can find specific edge-case clips instead of re-running the whole simulation.

How the model family fits into Nvidia's broader physical AI stack

Cosmos 2.5 does not exist in isolation. Nvidia is building a full-stack platform where each layer depends on the others.

Nvidia physical AI platform layers

Application (robot or AV deployment)
        │
Robot Policy (Isaac GR00T N1.6 VLA model)
        │
┌───────┴────────────────────┐
│   Cosmos WFMs (2.5 family) │
│  Predict / Transfer / Reason│
└───────┬────────────────────┘
        │
Omniverse simulation environment
        │
NVIDIA hardware (H100 / B200 / Jetson edge)
        │
Data curation (Cosmos Curator + Dataset Search)

Nvidia also introduced Isaac Lab-Arena, an open-source simulation framework hosted on GitHub that serves as another component of the physical AI platform, enabling safe virtual testing of robotic capabilities. Isaac GR00T N1.6, the humanoid robot VLA model, uses Cosmos Reason as its reasoning layer. The world models generate the training environments. Omniverse provides the simulation substrate before and after training.

Omniverse creates realistic 3D simulations of real-world tasks by using different generative APIs, SDKs, and NVIDIA RTX rendering technology. Developers can input Omniverse simulations as instructional videos into Cosmos Transfer models to generate controllable, photorealistic synthetic data.

This is not just a set of open models. It is a bet on vertical integration: Nvidia wants to own the data generation pipeline, the simulation environment, the inference hardware, and the model layer simultaneously.

How to get started with Cosmos 2.5

Both models are available now on Hugging Face under the NVIDIA Open Model License, which permits research and commercial use with attribution.

The fastest path to running inference:

Cosmos Predict 2.5 (2B): nvidia/Cosmos-Predict2.5-2B on Hugging Face. Use the GitHub inference scripts for post-training recipes.
Cosmos Transfer 2.5 (2B): nvidia/Cosmos-Transfer2.5-2B. The sim-to-real CARLA augmentation example in the Cosmos Cookbook is the clearest starting point for AV teams.
Cosmos Reason 1 (7B): Available as a standalone model or via NVIDIA NIM for API-based deployment without managing GPU infrastructure.

Hardware note: Running the 14B Predict model at 1280x720 is GPU-intensive. Nvidia quotes 2-2.5x faster inference on H100 and B200 GPUs compared to earlier hardware using sparse attention optimizations. The 2B models are usable on a single A100 but expect slower generation times than the published benchmarks.

For structured workflows, the Cosmos Cookbook on GitHub contains step-by-step recipes for robotics (RoboCasa, Libero benchmarks) and autonomous vehicle augmentation. If you are already using developer tools for AI pipelines, integrating Cosmos into an existing Isaac Sim or CARLA setup is the logical next step. Teams evaluating the broader physical AI landscape may also find our AI tools coverage useful for context on where Cosmos sits relative to other open model platforms.

What this means for developers

The headline number here is the 18x data multiplication rate from a single simulator scenario. For most robotics and AV teams, data is the constraint. Collecting hours of real-world edge-case footage is expensive, often dangerous, and sometimes physically impossible. Transfer 2.5 turns one carefully constructed simulator scenario into 18 photorealistic training variants without re-running anything.

The consolidation in Predict 2.5 also matters. Collapsing three separate models into one simplifies the pipeline for teams that were previously maintaining multiple checkpoints. Less infrastructure to manage means faster iteration.

The two real limitations to watch: first, you still need quality simulator output as the starting point for Transfer 2.5. The model makes synthetic footage more realistic, but it cannot fix fundamentally wrong geometry or broken physics in the underlying sim. Second, the 14B model requires serious compute. Teams without access to H100-class hardware will be limited to the 2B models, which trade some quality for accessibility.

Frequently asked questions

A world foundation model is trained on large-scale video to predict how physical environments evolve over time. Unlike language models (which predict text) or image models (which predict pixels in a static frame), world models learn temporal causality: what happens next given what is visible now. Cosmos 2.5 uses that capability to generate synthetic training video for robots and autonomous vehicles.

Yes, with conditions. Both Predict 2.5 and Transfer 2.5 are available on Hugging Face under the NVIDIA Open Model License, which allows research and commercial use. You run the models on your own hardware or cloud compute. There is no hosted inference fee from Nvidia for the open models, though Cosmos Reason 1 is also available via NVIDIA NIM, which is a paid managed API service.

Predict 2.5 generates world scenarios from scratch: give it a text prompt, an image, or a seed video, and it generates new footage of that scenario. Transfer 2.5 transforms existing simulator output into photorealistic footage while preserving the ground-truth geometry and labels. In practice, most teams will use both: Predict to generate new scenarios, and Transfer to make those scenarios look realistic enough to train production models.

Nvidia quotes 2-2.5x faster inference on H100 and B200 GPUs. The 2B models can run on a single A100 or equivalent. The 14B Predict model at full 1280x720 resolution is significantly more compute-intensive. Nvidia has added Blackwell and ARM inference support, so edge deployment for lighter workloads is possible, but most production training pipelines will need data-center class GPUs.

Cosmos 3 is Nvidia's newer omni-model announced at COMPUTEX 2026. It extends generation beyond video to include text, images, sound, and actions in a single Mixture-of-Transformers architecture. Cosmos 2.5 kept perception and generation as separate model types and was limited to text, image, and video modalities. Cosmos 2.5 remains the more accessible and better-documented option for teams building robotics and AV pipelines today, while Cosmos 3 is still in early access.