Skip to content

Commit cf75d30

Browse files
committed
AudioToken code
1 parent 6b15276 commit cf75d30

33 files changed

+203725
-1
lines changed

.idea/.gitignore

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/AudioToken.iml

Lines changed: 12 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/inspectionProfiles/Project_Default.xml

Lines changed: 21 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/inspectionProfiles/profiles_settings.xml

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/misc.xml

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/modules.xml

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/vcs.xml

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 91 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,91 @@
1-
# AudioToken
1+
# Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
2+
This repo contains the official PyTorch implementation of *Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation*
3+
4+
# Abstract
5+
In recent years, image generation has shown a great leap in per-
6+
formance, where diffusion models play a central role. Although
7+
generating high-quality images, such models are mainly con-
8+
ditioned on textual descriptions. This begs the question: how
9+
can we adopt such models to be conditioned on other modal-
10+
ities?. In this paper, we propose a novel method utilizing la-
11+
tent diffusion models, trained for text-to-image-generation, to
12+
generate images, conditioned on audio recordings. Using a pre-
13+
trained audio encoding model, the proposed method encodes
14+
audio into a new token which can be considered as an adap-
15+
tation layer between the audio and text representations. Such a
16+
modeling paradigm requires a small number of trainable param-
17+
eters making the proposed approach appealing for lightweight
18+
optimization. Results suggest the proposed method is superior
19+
to the evaluated baseline methods considering both objective
20+
and subjective metrics.
21+
22+
# News
23+
### May 17, 2023
24+
Weights for [Lura](https://huggingface.co/blog/lora) were added. With only 3.3 MB of weights, the performance is pushed to higher level. The weights are stored in ```output/weights/lora_layers_learned_embeds.bin```.
25+
26+
# Installation
27+
```
28+
git clone [email protected]:guyyariv/AudioToken.git
29+
cd AudioToken
30+
pip install -r requirements.txt
31+
```
32+
And initialize an Accelerate environment with:
33+
```angular2html
34+
accelerate config
35+
```
36+
Download BEATs pre-trained model
37+
```
38+
mkdir -p models/BEATs/ && wget "https://msranlcmtteamdrive.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2022-12-18T10%3A41%3A16Z&se=3022-12-19T10%3A41%3A00Z&sr=b&sp=r&sig=gSSExKP0otwVBgKwdV8FoMWL2VppARFq%2B26xKin5rKw%3D" -P "models/BEATs/"
39+
```
40+
41+
# Pre-Trained Embedder
42+
43+
![alt text](https://github.com/guyyariv/AudioToken/blob/master/figs/fig1.png)
44+
45+
The embedder's weights, which we pre-trained and on which the article is based, may be found at:
46+
```output/embedder_learned_embeds.bin```
47+
48+
# Training
49+
50+
First, download our data set. [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/)
51+
52+
```angular2html
53+
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
54+
export DATA_DIR="./vggsound/"
55+
export OUTPUT_DIR="output/"
56+
57+
accelerate launch train.py \
58+
--pretrained_model_name_or_path=$MODEL_NAME \
59+
--train_data_dir=$DATA_DIR \
60+
--output_dir=$OUTPUT_DIR
61+
--resolution=512 \
62+
--train_batch_size=4 \
63+
--gradient_accumulation_steps=4 \
64+
--max_train_steps=30000 \
65+
--learning_rate=1.0e-05 --scale_lr \
66+
--lr_scheduler="constant" \
67+
--lr_warmup_steps=0
68+
```
69+
Note: Change the resolution to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.
70+
# Inference
71+
72+
After you've trained a model with the above command, you can simply generate images using the following script:
73+
```angular2html
74+
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
75+
export DATA_DIR="./vggsound/"
76+
export OUTPUT_DIR="output/"
77+
export LEARNED_EMBEDS="output/embedder_learned_embeds.bin"
78+
79+
accelerate launch inference.py \
80+
--pretrained_model_name_or_path=$MODEL_NAME \
81+
--test_data_dir=$DATA_DIR \
82+
--output_dir=$OUTPUT_DIR \
83+
--learned_embeds=$LEARNED_EMBEDS
84+
```
85+
86+
# Cite
87+
88+
89+
# License
90+
This repository is released under the MIT license as found in the [LICENSE](LICENSE) file.
91+

constants/best_frames.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

constants/low_quality_videos.pkl

467 KB
Binary file not shown.

0 commit comments

Comments
 (0)