OpenMOSS · Sunwood-ai-labs · Mar 19, 2024 · Mar 19, 2024 · Mar 26, 2024 · Mar 26, 2024
diff --git a/.gitignore b/.gitignore
diff --git a/Docker/Dockerfile b/Docker/Dockerfile
@@ -0,0 +1,39 @@
+FROM nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update \
+    && apt-get upgrade -y \
+    && apt-get install -y --no-install-recommends \
+        git \
+        gcc \
+        curl \
+        wget \
+        sudo \
+        pciutils \
+        python3-all-dev \
+        python-is-python3 \
+        python3-pip \
+        ffmpeg \
+        libsdl2-dev \
+        pulseaudio \
+        alsa-utils \
+        portaudio19-dev \
+    && pip install pip -U
+
+WORKDIR /app
+
+RUN git clone https://github.com/Sunwood-ai-labs/AnyGPT-JP.git .
+
+RUN pip install -r requirements.txt
+
+RUN mkdir -p models/anygpt \
+    && mkdir -p models/seed-tokenizer-2 \
+    && mkdir -p models/speechtokenizer \
+    && mkdir -p models/soundstorm
+
+RUN pip install bigmodelvis \
+    && pip install peft \
+    && pip install --upgrade huggingface_hub
+
+# COPY scripts/download_models.py /app/scripts/download_models.py
+# RUN python scripts/download_models.py
diff --git a/Docker/README.JP.md b/Docker/README.JP.md
@@ -0,0 +1,43 @@
+# AnyGPTのDockerでの実行方法
+
+このREADMEでは、AnyGPTをDockerを用いて実行する方法を説明します。
+
+## 前提条件
+
+- Dockerがインストール済みであること
+- GPU環境で実行する場合は、NVIDIA Container Toolkitがインストール済みであること
+
+## 手順
+
+
+1. 以下のコマンドを実行して、Dockerイメージをビルドします。
+   ```bash
+   docker-compose up --build
+   ```
+
+2. モデルをダウンロードします。
+   ```bash
+   docker-compose run anygpt python /app/scripts/download_models.py
+   ```
+
+3. 推論を実行します。
+   ```bash
+   docker-compose run anygpt python anygpt/src/infer/cli_infer_base_model.py \
+     --model-name-or-path models/anygpt/base \
+     --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
+     --speech-tokenizer-path models/speechtokenizer/ckpt.dev \
+     --speech-tokenizer-config models/speechtokenizer/config.json \
+     --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \
+     --output-dir "infer_output/base"
+   ```
+
+6. 推論結果は `docker/infer_output/base` ディレクトリに出力されます。
+
+## トラブルシューティング
+
+- モデルのダウンロードに失敗する場合は、`download_models.py`スクリプトを確認し、必要に応じてURLを更新してください。
+- 推論の実行に失敗する場合は、コマンドの引数を確認し、モデルのパスが正しいことを確認してください。
+
+## 注意事項
+
+- モデルのダウンロードと推論の実行には、大量のメモリとディスク容量が必要です。十分なリソースを確保してください
diff --git a/Docker/README.md b/Docker/README.md
@@ -0,0 +1,42 @@
+# Running AnyGPT with Docker
+
+This README explains how to run AnyGPT using Docker.
+
+## Prerequisites
+
+- Docker is installed
+- NVIDIA Container Toolkit is installed if running in a GPU environment
+
+## Steps
+
+1. Build the Docker image by running the following command:
+   ```bash
+   docker-compose up --build
+   ```
+
+2. Download the models:
+   ```bash
+   docker-compose run anygpt python /app/scripts/download_models.py
+   ```
+
+3. Run the inference:
+   ```bash
+   docker-compose run anygpt python anygpt/src/infer/cli_infer_base_model.py \
+     --model-name-or-path models/anygpt/base \
+     --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
+     --speech-tokenizer-path models/speechtokenizer/ckpt.dev \
+     --speech-tokenizer-config models/speechtokenizer/config.json \
+     --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \
+     --output-dir "infer_output/base"
+   ```
+
+4. The inference results will be output to the `docker/infer_output/base` directory.
+
+## Troubleshooting
+
+- If the model download fails, check the `download_models.py` script and update the URLs if necessary.
+- If the inference execution fails, check the command arguments and ensure that the model paths are correct.
+
+## Notes
+
+- Downloading the models and running the inference requires a large amount of memory and disk space. Ensure that sufficient resources are available.
diff --git a/README.md b/README.md
@@ -2,9 +2,15 @@
 <a href='https://junzhan2000.github.io/AnyGPT.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>  <a href='https://arxiv.org/pdf/2402.12226.pdf'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> [![](https://img.shields.io/badge/Datasets-AnyInstruct-yellow)](https://huggingface.co/datasets/fnlp/AnyInstruct)
 
 <p align="center">
-    <img src="static/images/logo.png" width="16%"> <br>
+    <img src="https://raw.githubusercontent.com/OpenMOSS/AnyGPT/main/static/images/logo.png" width="16%"> <br>
 </p>
 
+<div align="center">
+
+ | [日本語](docs/README_JP.md) | [English](README.md) |
+
+</div>
+
 ## Introduction
 We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. The [base model](https://huggingface.co/fnlp/AnyGPT-base) aligns the four modalities, allowing for intermodal conversions between different modalities and text. Furthermore, we constructed the [AnyInstruct](https://huggingface.co/datasets/fnlp/AnyInstruct) dataset based on various generative models, which contains instructions for arbitrary modal interconversion. Trained on this dataset, our [chat model](https://huggingface.co/fnlp/AnyGPT-chat) can engage in free multimodal conversations, where multimodal data can be inserted at will.
 
@@ -36,7 +42,7 @@ pip install -r requirements.txt
 ### Model Weights
 * Check the AnyGPT-base weights in [fnlp/AnyGPT-base](https://huggingface.co/fnlp/AnyGPT-base)
 * Check the AnyGPT-chat weights in [fnlp/AnyGPT-chat](https://huggingface.co/fnlp/AnyGPT-chat)
-* Check the SpeechTokenizer and Soundstorm weights in [fnlp/AnyGPT-speech-modules](https://huggingface.co/fnlp/AnyGPT-base)
+* Check the SpeechTokenizer and Soundstorm weights in [fnlp/AnyGPT-speech-modules](https://huggingface.co/fnlp/AnyGPT-speech-modules)
 * Check the SEED tokenizer weights in [AILab-CVC/seed-tokenizer-2](https://huggingface.co/AILab-CVC/seed-tokenizer-2)
 
 
@@ -45,6 +51,10 @@ The SpeechTokenizer is used for tokenizing and reconstructing speech, Soundstorm
 The model weights of unCLIP SD-UNet which are used to reconstruct the image, and Encodec-32k which are used to tokenize and reconstruct music will be downloaded automatically.
 
 ### Base model CLI Inference
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13_gZPIRG6ShkAbI76-hC_etvfGhry0DZ?usp=sharing)
+
+
 ```bash
 python anygpt/src/infer/cli_infer_base_model.py \
 --model-name-or-path "path/to/AnyGPT-7B-base" \

diff --git a/anygpt/src/__pycache__/__init__.cpython-39.pyc b/anygpt/src/__pycache__/__init__.cpython-39.pyc
diff --git a/anygpt/src/infer/__pycache__/__init__.cpython-39.pyc b/anygpt/src/infer/__pycache__/__init__.cpython-39.pyc
diff --git a/anygpt/src/infer/__pycache__/pre_post_process.cpython-39.pyc b/anygpt/src/infer/__pycache__/pre_post_process.cpython-39.pyc
diff --git a/anygpt/src/infer/__pycache__/voice_clone.cpython-39.pyc b/anygpt/src/infer/__pycache__/voice_clone.cpython-39.pyc
diff --git a/anygpt/src/infer/pre_post_process.py b/anygpt/src/infer/pre_post_process.py
@@ -4,7 +4,7 @@
 sys.path.append("/mnt/petrelfs/zhanjun.p/mllm")
 sys.path.append("/mnt/petrelfs/zhanjun.p/src")
 from transformers import GenerationConfig
-from mmgpt.src.m_utils.prompter_mmgpt import Prompter
+from anygpt.src.m_utils.prompter import Prompter
 from tqdm import tqdm
 
 from m_utils.conversation import get_conv_template
@@ -237,4 +237,4 @@ def text_normalization(original_text):
     tag1 = '[MMGPT]'
     tag2 = '[eom]'
     print(extract_content_between_final_tags(example_text, tag1, tag2))
-    print(extract_text_between_tags(example_text, tag1, tag2))
+    print(extract_text_between_tags(example_text, tag1, tag2))
diff --git a/anygpt/src/m_utils/__pycache__/__init__.cpython-311.pyc b/anygpt/src/m_utils/__pycache__/__init__.cpython-311.pyc
diff --git a/anygpt/src/m_utils/__pycache__/__init__.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/__init__.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/anything2token.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/anything2token.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/conversation.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/conversation.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/instructions.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/instructions.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/other2text_instructions.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/other2text_instructions.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/output.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/output.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/prompter.cpython-311.pyc b/anygpt/src/m_utils/__pycache__/prompter.cpython-311.pyc
diff --git a/anygpt/src/m_utils/__pycache__/prompter.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/prompter.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/prompter_mmgpt.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/prompter_mmgpt.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/prompter_old.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/prompter_old.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/read_modality.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/read_modality.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/templates.cpython-311.pyc b/anygpt/src/m_utils/__pycache__/templates.cpython-311.pyc
diff --git a/anygpt/src/m_utils/__pycache__/templates.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/templates.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/templates_old.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/templates_old.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/text2other_instructions.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/text2other_instructions.cpython-39.pyc
diff --git a/anygpt/src/m_utils/__pycache__/transforms.cpython-39.pyc b/anygpt/src/m_utils/__pycache__/transforms.cpython-39.pyc
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,21 @@
+version: '3'
+services:
+  anygpt:
+    image: anygpt
+    build:
+      context: ./Docker
+      dockerfile: Dockerfile
+    volumes:
+      - ./:/app
+      - ./.cache:/root/.cache
+
+    env_file:
+      - .env
+    tty: true
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [ gpu ]
diff --git a/docs/README_JP.md b/docs/README_JP.md
@@ -0,0 +1,134 @@
+# AnyGPT: 離散シーケンスモデリングを用いた統一マルチモーダル大規模言語モデル
+
+<a href='https://junzhan2000.github.io/AnyGPT.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>  <a href='https://arxiv.org/pdf/2402.12226.pdf'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> [![](https://img.shields.io/badge/Datasets-AnyInstruct-yellow)](https://huggingface.co/datasets/fnlp/AnyInstruct)
+
+<p align="center">
+    <img src="https://raw.githubusercontent.com/OpenMOSS/AnyGPT/main/static/images/logo.png" width="16%"> <br>
+</p>
+
+<div align="center">
+
+ | [日本語](README_JP.md) | [English](../README.md) |
+
+</div>
+
+
+## はじめに
+AnyGPTは、音声、テキスト、画像、音楽など様々なモダリティを統一的に処理するための、離散表現を利用した任意のモダリティ間の変換が可能なマルチモーダル言語モデルです。[ベースモデル](https://huggingface.co/fnlp/AnyGPT-base)は4つのモダリティを揃え、異なるモダリティとテキストの間の相互変換を可能にします。さらに、様々な生成モデルを基に、任意のモーダル間変換の指示を含む[AnyInstruct](https://huggingface.co/datasets/fnlp/AnyInstruct)データセットを構築しました。このデータセットで学習された[チャットモデル](https://huggingface.co/fnlp/AnyGPT-chat)は、自由にマルチモーダルデータを挿入できる自由なマルチモーダル会話を行うことができます。
+
+AnyGPTは、全てのモーダルデータを統一された離散表現に変換し、次のトークン予測タスクを用いて大規模言語モデル（LLM）上で統一学習を行う生成学習スキームを提案しています。「圧縮は知性である」という観点から、Tokenizerの品質が十分に高く、LLMのperplexity（PPL）が十分に低ければ、インターネット上の膨大なマルチモーダルデータを同じモデルに圧縮することが可能となり、純粋なテキストベースのLLMにはない能力が現れると考えられます。
+デモは[プロジェクトページ](https://junzhan2000.github.io/AnyGPT.github.io)で公開しています。
+
+## デモ例
+[![視频标题](http://img.youtube.com/vi/oW3E3pIsaRg/0.jpg)](https://www.youtube.com/watch?v=oW3E3pIsaRg)
+
+## オープンソースチェックリスト
+- [x] ベースモデル 
+- [ ] チャットモデル
+- [x] 推論コード
+- [x] 指示データセット
+
+## 推論
+
+### インストール
+
+```bash
+git clone https://github.com/OpenMOSS/AnyGPT.git
+cd AnyGPT
+conda create --name AnyGPT python=3.9
+conda activate AnyGPT
+pip install -r requirements.txt
+```
+
+### モデルの重み
+* AnyGPT-baseの重みは[fnlp/AnyGPT-base](https://huggingface.co/fnlp/AnyGPT-base)を確認してください。
+* AnyGPT-chatの重みは[fnlp/AnyGPT-chat](https://huggingface.co/fnlp/AnyGPT-chat)を確認してください。 
+* SpeechTokenizerとSoundstormの重みは[fnlp/AnyGPT-speech-modules](https://huggingface.co/fnlp/AnyGPT-speech-modules)を確認してください。
+* SEED tokenizerの重みは[AILab-CVC/seed-tokenizer-2](https://huggingface.co/AILab-CVC/seed-tokenizer-2)を確認してください。
+
+SpeechTokenizerは音声のトークン化と再構成に使用され、Soundstormはパラ言語情報の補完を担当し、SEED-tokenizerは画像のトークン化に使用されます。
+
+画像の再構成に使用されるunCLIP SD-UNetのモデルの重みと、音楽のトークン化と再構成に使用されるEncodec-32kは自動的にダウンロードされます。
+
+### ベースモデルCLI推論
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13_gZPIRG6ShkAbI76-hC_etvfGhry0DZ?usp=sharing)
+
+```bash
+python anygpt/src/infer/cli_infer_base_model.py \
+--model-name-or-path "path/to/AnyGPT-7B-base" \
+--image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
+--speech-tokenizer-path "path/to/model" \
+--speech-tokenizer-config "path/to/config" \  
+--soundstorm-path "path/to/model" \
+--output-dir "infer_output/base"
+```
+
+例:
+```bash 
+python anygpt/src/infer/cli_infer_base_model.py \
+--model-name-or-path models/anygpt/base \
+--image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
+--speech-tokenizer-path models/speechtokenizer/ckpt.dev \
+--speech-tokenizer-config models/speechtokenizer/config.json \
+--soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \
+--output-dir "infer_output/base"
+```
+
+#### 対話
+ベースモデルは、テキストから画像、画像キャプション、自動音声認識（ASR）、ゼロショットテキスト音声合成（TTS）、テキストから音楽、音楽キャプションなど、様々なタスクを実行できます。
+
+特定の指示フォーマットに従って推論を行うことができます。
+
+* テキストから画像
+  * ```text|image|{キャプション}```
+  * 例:
+  ```text|image|カラフルなテントの下で異国の商品を売る露店が立ち並ぶ活気あふれる中世の市場の風景```
+* 画像キャプション  
+  * ```image|text|{キャプション}```
+  * 例:
+  ```image|text|static/infer/image/cat.jpg```
+* TTS（ランダムな声）
+  * ```text|speech|{音声の内容}``` 
+  * 例:
+  ```text|speech|私はナッツの殻の中に閉じ込められていても、無限の空間の王者だと思えます。```
+* ゼロショットTTS
+  * ```text|speech|{音声の内容}|{声のプロンプト}```
+  * 例: 
+  ```text|speech|私はナッツの殻の中に閉じ込められていても、無限の空間の王者だと思えます。|static/infer/speech/voice_prompt1.wav/voice_prompt3.wav```
+* ASR
+  * ```speech|text|{音声ファイルのパス}```
+  * 例: ```speech|text|AnyGPT/static/infer/speech/voice_prompt2.wav```  
+* テキストから音楽
+  * ```text|music|{キャプション}```
+  * 例:
+  ```text|music|夢のような心地よい雰囲気を醸し出す独特の要素を持つインディーロックサウンド```
+* 音楽キャプション
+  * ```music|text|{音楽ファイルのパス}```
+  * 例: ```music|text|static/infer/music/features an indie rock sound with distinct element.wav```
+
+**注意**
+
+異なるタスクには、異なる言語モデルのデコード戦略を使用しています。画像、音声、音楽生成のデコード設定ファイルは、それぞれ```config/image_generate_config.json```、```config/speech_generate_config.json```、```config/music_generate_config.json```にあります。他のモダリティからテキストへのデコード設定ファイルは、```config/text_generate_config.json```にあります。パラメータを直接変更または追加して、デコード戦略を変更できます。
+
+データと学習リソースの制限により、モデルの生成はまだ不安定な場合があります。複数回生成するか、異なるデコード戦略を試してください。
+
+音声と音楽の応答は ```.wav```ファイルに保存され、画像の応答は```jpg```に保存されます。ファイル名はプロンプトと時間を連結したものになります。これらのファイルへのパスは応答に示されます。
+
+## 謝辞
+- [SpeechGPT](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt), [Vicuna](https://github.com/lm-sys/FastChat): 構築したコードベース。
+- [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer)、[soundstorm-speechtokenizer](https://github.com/ZhangXInFD/soundstorm-speechtokenizer)、[SEED-tokenizer](https://github.com/AILab-CVC/SEED)の素晴らしい仕事に感謝します。
+
+## ライセンス
+`AnyGPT`は、[LLaMA2](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)の元の[ライセンス](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)の下でリリースされています。
+
+## 引用
+AnyGPTとAnyInstructが研究やアプリケーションに役立つと感じた場合は、ぜひ引用してください。
+```
+@article{zhan2024anygpt,
+  title={AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling},
+  author={Zhan, Jun and Dai, Junqi and Ye, Jiasheng and Zhou, Yunhua and Zhang, Dong and Liu, Zhigeng and Zhang, Xin and Yuan, Ruibin and Zhang, Ge and Li, Linyang and others},  
+  journal={arXiv preprint arXiv:2402.12226},
+  year={2024}
+}
+```