Speech-Driven Expression Blendshape Based on Single-Layer Self-attention Network

HelloWorld: 9th

Our final results are as follows.

result.mp4

For more contest details, please refer to official website.

1. Data Process

The data provided here is Not complete data. Due to copyright issues, please request the dataset on the official website.

Distribution of speech frame lengths:

Run

conda create -n aligner -c conda-forge montreal-forced-aligner python=3.8
conda activate aligner
conda config --add channels conda-forge
conda install montreal-forced-aligner
mfa models download acoustic mandarin_mfa
mfa model download dictionary mandarin_mfa
mfa model inspect acoustic mandarin_mfa      # View the acoustic model

Then

pip install -r requirement1.txt
python data_process/process.py

2. Train model

2.1 Dependencies

python 3.7

conda create -n AIWIN python=3.7
conda activate AIWIN
cd <path to your project>`
pip install -r requirements.txt

2.2 Make lmdb data

python My/scripts/aiwin_dataset_to_lmdb.py ./data

Copy the output and paste to Tri/config/multimodal_context.yml, such as:

data_mean: [0.07876, 0.00280, 0.01174, 0.18354, 0.10486, 0.16363, 0.10860, 0.00205, 0.01784, 0.22835, 0.22417, 0.00615, 0.00558, 0.06443, 0.06593, 0.18330, 0.17782, 0.06199, 0.04290, 0.04572, 0.19684, 0.03967, 0.03928, 0.29169, 0.29800, 0.05240, 0.04886, 0.17750, 0.17757, 0.09945, 0.00002, 0.00002, 0.01264, 0.12944, 0.12708, 0.08526, 0.08594]
data_std: [0.04059, 0.00566, 0.01210, 0.11373, 0.09498, 0.11489, 0.10505, 0.01894, 0.02042, 0.14744, 0.14685, 0.01871, 0.01970, 0.02078, 0.02137, 0.05933, 0.05742, 0.04199, 0.04282, 0.02453, 0.08287, 0.00617, 0.00626, 0.16426, 0.16778, 0.02844, 0.02643, 0.04753, 0.04750, 0.04343, 0.00047, 0.00047, 0.00456, 0.04094, 0.04016, 0.01510, 0.01562]

If you meet

(AIWIN) [yangsc21@mjrc-server11 AIWIN]$ python My/scripts/aiwin_dataset_to_lmdb.py ./data
Traceback (most recent call last):
  File "My/scripts/aiwin_dataset_to_lmdb.py", line 7, in <module>
    import pyarrow
  File "/ceph/home/yangsc21/anaconda3/envs/AIWIN/lib/python3.7/site-packages/pyarrow/__init__.py", line 49, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libcrypt.so.1: cannot open shared object file: No such file or directory

Try to check whereis libcrypt.so.1 and ln /usr/lib/libcrypt.so libcrypt.so.1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"/usr/lib/libcrypt.so.1.0"

2.3 Pretrained Model

mkdir <your_home_dir>/chinese-hubert-large

Download TencentGameMate/chinese-hubert-large from here.

And put it to <your_home_dir>/chinese-hubert-large

2.4 Train

cd Tri/scripts
python train.py --config=<..your path/Tri/config/multimodal_context.yml>

2.5 Inference

python synthesize.py --ckpt_path "... your path/result/output_myfastdtw_batchfist_interpolate_normalize_dropout_data_decoder_val3_5_4_onehot/train_multimodal_context/multimodal_context_checkpoint_326.bin" --transcript_path "... your path/data/val/tsv/A10.tsv" --wav_path "... your path/data/val/wav/A10.wav"

3. PostProcess

Modify paths in data_process.

Run postprocess.py to smooth the output.

Run postprocess_3.py to deflate the output.

Run postprocess_2.py to perform a weighted average of the results from multiple models.

Run add_eye to select suitable eye expressions from the training and validation sets to add. Methodology for adding: Find the closest csv file with frame number greater than or equal to the generated Blendshape in the training and validation sets, and intercept the eye action with the generated Blendshape frame number as the added eye action. Frame count in the training and validation sets:

4. Model performance

Average time to process 1s audio: 0.025s

5. Visualization

Download .fbx model (e.g. by iphone) and you can use blender.py based on blender to visualize the blendshape file .csv like this:

Visualization.mp4

The final video rendering (like 申䒕雅) of the subjective evaluation is generated by the organizer based on the .csv blendshape file. Due to copyright issues not shown here.

6. Conclusion

As you can see, our model is fairly simple, and just analyzing the data and processing it can be a very significant improvement to the results. Please feel free to contact me ([email protected]) with any question or concerns.

7. [new!] Pretrained model

please see ./result/

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
My/scripts		My/scripts
Tri		Tri
data		data
data_process		data_process
result/output_myfastdtw_batchfist_interpolate_normalize_dropout_data_decoder_val3_5_1		result/output_myfastdtw_batchfist_interpolate_normalize_dropout_data_decoder_val3_5_1
20220904-Final.png		20220904-Final.png
Distribution of speech frame lengths in the verification set.jpg		Distribution of speech frame lengths in the verification set.jpg
Distribution of speech frame lengths.png		Distribution of speech frame lengths.png
Frame count in the training and validation sets.png		Frame count in the training and validation sets.png
Length distribution of speech frames in the training set.jpg		Length distribution of speech frames in the training set.jpg
Model.png		Model.png
Parameters.png		Parameters.png
Readme.md		Readme.md
Result.png		Result.png
Result2.png		Result2.png
blender.py		blender.py
evaluate.py		evaluate.py
myfastdtw.py		myfastdtw.py
process.png		process.png
requirements.txt		requirements.txt
requirements1.txt		requirements1.txt
result.mp4		result.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-Driven Expression Blendshape Based on Single-Layer Self-attention Network

HelloWorld: 9th

1. Data Process

2. Train model

2.1 Dependencies

2.2 Make lmdb data

2.3 Pretrained Model

2.4 Train

2.5 Inference

3. PostProcess

4. Model performance

5. Visualization

6. Conclusion

7. [new!] Pretrained model

About

Releases

Packages

Languages

YoungSeng/Speech-driven-expressions

Folders and files

Latest commit

History

Repository files navigation

Speech-Driven Expression Blendshape Based on Single-Layer Self-attention Network

HelloWorld: 9th

1. Data Process

2. Train model

2.1 Dependencies

2.2 Make lmdb data

2.3 Pretrained Model

2.4 Train

2.5 Inference

3. PostProcess

4. Model performance

5. Visualization

6. Conclusion

7. [new!] Pretrained model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages