init

peanutcocktail · Aug 5, 2024 · 34dd614 · 34dd614
1 parent 30ebd13
commit 34dd614
Show file tree

Hide file tree

Showing 131 changed files with 22,766 additions and 3,851 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,4 +5,5 @@ runs/
 checkpoints/
 master_ip
 logs/
-*.DS_Store
+*.DS_Store
+.idea
diff --git a/Dockerfile b/Dockerfile
diff --git a/LICENSE b/LICENSE
@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright [yyyy] [name of copyright owner]
+   Copyright 2024 CogVideo Model Team @ Zhipu AI
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/Model_License b/Model_License
diff --git a/README.md b/README.md
@@ -1,93 +1,125 @@
-# CogVideo
+# CogVideoX
 
-This is the official repo for the paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](http://arxiv.org/abs/2205.15868).
+[中文阅读](./README_zh.md)
 
+<div align="center">
+<img src=resources/logo.svg width="50%"/>
+<p align="center">
 
-**News!** The [demo](https://models.aminer.cn/cogvideo/) for CogVideo is available! 
+🤗 Experience on <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a>
+</p>
+</div>
+<p align="center">
+    👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> 
+</p>
+<p align="center">
+📍 Visit <a href="https://chatglm.cn/video">清影</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models.
+</p>
 
-It's also integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/THUDM/CogVideo)
+## Update and News
 
+- 🔥 **News**: ``2024/8/6``: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct the video almost losslessly.
+- 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
+  generation models.
 
-**News!** The code and model for text-to-video generation is now available! Currently we only supports *simplified Chinese input*. 
+**More powerful models with larger parameter sizes are on the way~ Stay tuned!**
 
-https://user-images.githubusercontent.com/48993524/170857367-2033c514-3c9f-4297-876f-2468592a254b.mp4
+## CogVideoX-2B Gallery
 
-* **Read** our paper [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) on ArXiv for a formal introduction. 
-* **Try** our demo at [https://models.aminer.cn/cogvideo/](https://models.aminer.cn/cogvideo/)
-* **Run** our pretrained models for text-to-video generation. Please use A100 GPU.
-* **Cite** our paper if you find our work helpful
+<div align="center">
+  <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
+  <p>A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.</p>
+</div>
 
-```
-@article{hong2022cogvideo,
-  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
-  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
-  journal={arXiv preprint arXiv:2205.15868},
-  year={2022}
-}
-```
+<div align="center">
+  <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="80%" controls autoplay></video>
+  <p>The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.</p>
+</div>
 
-## Web Demo
+<div align="center">
+  <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="80%" controls autoplay></video>
+  <p>A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.</p>
+</div>
 
-The demo for CogVideo is at [https://models.aminer.cn/cogvideo/](https://models.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video generation. *The original input is in Chinese.*
+<div align="center">
+  <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="80%" controls autoplay></video>
+  <p>In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.</p>
+</div>
 
+## Model Introduction
 
-## Generated Samples
+CogVideoX is an open-source version of the video generation model, which is homologous
+to [清影](https://chatglm.cn/video).
 
-**Video samples generated by CogVideo**. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniformly for display purposes.
+The table below shows the list of video generation models we currently provide,
+along with related basic information:
 
-![Intro images](assets/intro-image.png)
+| Model Name                                | CogVideoX-2B                                                 | 
+|-------------------------------------------|--------------------------------------------------------------|
+| Prompt Language                           | English                                                      | 
+| GPU Memory Required for Inference (FP16)  | 21.6GB                                                       | 
+| GPU Memory Required for Fine-tuning(bs=1) | 46.2GB                                                       |
+| Prompt Max  Length                        | 226 Tokens                                                   |
+| Video Length                              | 6 seconds                                                    | 
+| Frames Per Second                         | 8 frames                                                     | 
+| Resolution                                | 720 * 480                                                    |
+| Quantized Inference                       | Not Supported                                                |          
+| Multi-card Inference                      | Not Supported                                                |                             
+| Download Link                             | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) |
 
-![More samples](assets/appendix-moresamples.png)
+## Project Structure
 
+This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples
+of the **CogVideoX** open-source model.
 
+### Inference
 
-**CogVideo is able to generate relatively high-frame-rate videos.**
-A 4-second clip of 32 frames is shown below. 
++ [cli_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of
+  common parameters.
++ [cli_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
++ [convert_demo](inference/converter_demo.py): How to convert user input into a format suitable for CogVideoX.
++ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model
+  to generate videos.
 
-![High-frame-rate sample](assets/appendix-sample-highframerate.png)
+<div style="text-align: center;">
+    <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
+</div>
 
-## Getting Started
+### sat
 
-### Setup
++ [sat_demo](sat/configs/README_zh.md): Contains the inference code and fine-tuning code of SAT weights. It is
+  recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
+  rapid stacking and development.
 
-* Hardware: Linux servers with Nvidia A100s are recommended, but it is also okay to run the pretrained models with smaller `--max-inference-batch-size` and `--batch-size` or training smaller models on less powerful GPUs.
-* Environment: install dependencies via `pip install -r requirements.txt`. 
-* LocalAttention: Make sure you have CUDA installed and compile the local attention kernel.
+### Tools
 
-```shell
-pip install git+https://github.com/Sleepychord/Image-Local-Attention
-```
+This folder contains some tools for model conversion / caption generation, etc.
 
-## Docker
-Alternatively you can use Docker to handle all dependencies.
++ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights.
++ [caption_demo](tools/caption_demo.py): Caption tool, a model that understands videos and outputs them in text.
 
-1. Run ```./build_image.sh```
-2. Run ```./run_image.sh```
-3. Run ```./install_image_local_attention```
+## Project Plan
 
-Optionally, after that you can recommit the image to avoid having to install image local attention again.
+- [x] Open source CogVideoX model
+    - [x] Open source 3D Causal VAE used in CogVideoX.
+    - [x] CogVideoX model inference example (CLI / Web Demo)
+    - [x] CogVideoX online experience demo (Huggingface Space)
+    - [x] CogVideoX open source model API interface example (Huggingface)
+    - [x] CogVideoX model fine-tuning example (SAT)
+    - [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
+    - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
+    - [ ] Release CogVideoX technical report
 
+We welcome your contributions. You can click [here](resources/contribute.md) for more information.
 
-### Download
+## Model License
 
-Our code will automatically download or detect the models into the path defined by environment variable `SAT_HOME`. You can also manually download [CogVideo-Stage1](https://lfs.aminer.cn/misc/cogvideo/cogvideo-stage1.zip) , [CogVideo-Stage2](https://lfs.aminer.cn/misc/cogvideo/cogvideo-stage2.zip) and [CogView2-dsr](https://model.baai.ac.cn/model-detail/100041) place them under SAT_HOME (with folders named `cogvideo-stage1` , `cogvideo-stage2` and `cogview2-dsr`)
+The code in this repository is released under the [Apache 2.0 License](LICENSE).
 
-### Text-to-Video Generation
+The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).
 
-```
-./scripts/inference_cogvideo_pipeline.sh
-```
+## Citation
 
-Arguments useful in inference are mainly:
+🌟 If you find our work helpful, please leave us a star. 🌟
 
-* `--input-source [path or "interactive"]`. The path of the input file with one query per line. A CLI would be launched when using "interactive".
-* `--output-path [path]`. The folder containing the results.
-* `--batch-size [int]`. The number of samples will be generated per query.
-* `--max-inference-batch-size [int]`. Maximum batch size per forward. Reduce it if OOM. 
-* `--stage1-max-inference-batch-size [int]` Maximum batch size per forward in Stage 1. Reduce it if OOM. 
-* `--both-stages`. Run both stage1 and stage2 sequentially. 
-* `--use-guidance-stage1` Use classifier-free guidance in stage1, which is strongly suggested to get better results. 
-
-You'd better specify an environment variable `SAT_HOME` to specify the path to store the downloaded model.
-
-*Currently only Chinese input is supported.*
+The paper is still being written and will be released soon. Stay tuned!