Cosmos-Predict1 is a key branch of Cosmos World Foundation Models (WFMs) specialized for future state prediction, often referred to as world models. The tree main branches of Cosmos WFMs are cosmos-predict, cosmos-transfer, and cosmos-reason. We visualize the architecture of Cosmos-Predict1 in the following figure.
Cosmos-Predict1 includes the following:
- Diffusion-based world foundation models for Text2World and Video2World generation, where a user can generate visual simulation based on text prompts and video prompts.
- Autoregressive-based world foundation models for Video2World generation, where a user can generate visual simulation based on video prompts and optional text prompts.
- Image and video tokenizers for tokenizing videos into continuous tokens (latent vectors) and discrete tokens (integers) efficiently and effectively.
- Post-training scripts for helping Physical AI builders post-train pre-trained Cosmos-Predict1 for their applications.
428228630-b001966c-5f5e-4927-a3fe-44d142dd0ab1.mp4
428228629-0bbba982-c6fd-4388-a46f-bf91ce4099ad.mp4
We provide a comphrehensive set of examples to illustrate how to perform inference, post-training, etc, with Cosmos-Predict1. Click a relevant example below and start your Cosmos journey.
Please refer to INSTALL.md for general instructions on environment setup.
- Inference with diffusion-based Text2World models [with multi-GPU support]
- Inference with diffusion-based Video2World models [with multi-GPU support]
- Inference with autoregressive-based base models [with multi-GPU support]
- Inference with autoregressive-based Video2World models [with multi-GPU support]
- Inference with tokenizer models
- Post-train diffusion-based Text2World models using custom datasets [with multi-node support]
- Post-train diffusion-based Video2World models using custom datasets [with multi-node support]
- Post-train diffusion-based Text2World models using custom multi-view datasets [with multi-node support]
- Post-train diffusion-based Video2World models using custom multi-view datasets) [with multi-node support]
- Post-train autoregressive-based base models using custom datasets [with multi-node support]
- Post-train tokenizers using custom datasets [with multi-node support]
- Inference with post-trained multi-view diffusion-based Text2World models) [with multi-GPU support]
- Inference with post-trained multi-view diffusion-based Video2World models) [with multi-GPU support]
Cosmos-Predict1 include the following models
Diffusion models
- Cosmos-Predict1-7B-Text2World: Text to visual world generation
- Cosmos-Predict1-14B-Text2World: Text to visual world generation
- Cosmos-Predict1-7B-Video2World: Video + Text based future visual world generation
- Cosmos-Predict1-14B-Video2World: Video + Text based future visual world generation
Autoregressive models
- Cosmos-Predict1-4B: Future visual world generation
- Cosmos-Predict1-12B: Future visual world generation
- Cosmos-Predict1-5B-Video2World: Video + Text based future visual world generation
- Cosmos-Predict1-13B-Video2World: Video + Text based future visual world generation
Tokenizers
- Cosmos-Tokenize1-CV8×8×8-720p: Continuous Video Tokenizer with 8x8x8 spatio-temporal compression with, 121 frames context
- Cosmos-Tokenize1-DV8×16×16-720p: Discrete Video Tokenizer with 8x16x16 spatio-temporal compression, and 49 frames context
- Cosmos-Tokenize1-CI8×8-360p: Continuous Image Tokenizer with 8x8 spatial compression with low-resolution support
- Cosmos-Tokenize1-CI16x16-360p: Continuous Image Tokenizer with 16x16 spatial compression with low-resolution support
- Cosmos-Tokenize1-CV4×8×8-360p: Continuous Video Tokenizer with 4x8x8 spatio-temporal compression with low-resolution support
- Cosmos-Tokenize1-DI8×8-360p: Discrete Image Tokenizer with 8x8 spatial compression with low-resolution support
- Cosmos-Tokenize1-DI16x16-360p: Discrete Image Tokenizer with 16x16 spatial compression with low-resolution support
- Cosmos-Tokenize1-DV4×8×8-360p: Discrete Video Tokenizer with 4x8x8 spatio-temporal compression with low-resolution support
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
This model includes safety and content moderation features powered by Llama Guard 3. Llama Guard 3 is used solely as a content input filter and is subject to its own license.
NVIDIA Cosmos source code is released under the Apache 2 License.
NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact [email protected].