MAE pre-training models (ViT-base, ViT-small, ViT-tiny) using 270K AffectNet images for static facial expression recognition (SFER).
MAE ViT-Base pre-training on 270K AffectNet with a single 3090 GPU:
python -m torch.distributed.launch main_pretrain.py \
--model mae_vit_base_patch16 \
--batch_size 32 \
--accum_iter 4 --mask_ratio 0.75 \
--blr 1.5e-4 \
--epochs 300 \
--warmup_epochs 40 --weight_decay 0.05 \
--data_path /data/tao/fer/dataset/AffectNetdataset/Manually_Annotated_Images \
--output_dir /path/to/./out_dir_base
ViT-Small and ViT-Tiny follow the same parameter settings as ViT-Base for MAE pre-training, except for the --model and --output_dir.
--model mae_vit_small_patch16 and --output_dir /path/to/./out_dir_small for ViT-Small--model mae_vit_tiny_patch16 and --output_dir /path/to/./out_dir_tiny for ViT-Tiny
ConvNeXt V2-Base pre-training on 270K AffectNet with a single 3090 GPU:
python -m torch.distributed.launch main_pretrain_convnextv2.py \
--model convnextv2_base \
--batch_size 64 --update_freq 8 \
--blr 1.5e-4 \
--epochs 400 \
--warmup_epochs 40 \
--data_path /data/tao/fer/dataset/AffectNetdataset/Manually_Annotated_Images \
--output_dir /path/to/./out_dir_base_1
Fine-tuning MAE ViT-Base on RAF-DB with a single GPU:
python -m torch.distributed.launch main_rafdb.py\
--nproc_per_node=1 \
--learning-rate 1e-5 \
--epoch 120 \
--model-name vit_base_fixedpe_patch16_224 \
--resume \
--checkpoint-whole checkpoint/vit-base-checkpoint-300.pth
--mixup
Fine-tuning MAE ViT-Base on 270K AffectNet with a single 3090 GPU:
python -m torch.distributed.launch main_finetune_affectnet.py \
--model mae_vit_base_patch16 \
--batch_size 16 \
--accum_iter 2 \
--blr 5e-4 --layer_decay 0.65 --weight_decay 0.05 \
--drop_path 0.1 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
--epochs 10 \
--finetune '/path/out_dir_base_1/vit_base_checkpoint-299.pth'
Hint: few training epochs is recommaned on AffectNet dataset to avoid overfitting noisy labels.
In addition, data augmentation tricks can significantly improve fine-tuning performance. Such as flip, colorjit, affine transformation, RandomErase, and mixup.
ViT-Small and ViT-Tiny follow the same parameter settings as ViT-Base for MAE fine-tuning, except for the --model and --finetune.
--model mae_vit_small_patch16 and --finetune /path/out_dir_small_1/vit_small_checkpoint-300.pth for ViT-Small--model mae_vit_tiny_patch16 and --finetune /path/out_dir_tiny_1/vit_tiny_checkpoint-300.pth for ViT-Tiny
ConvNeXt V2-Base fine-tuning on RAF-DB with a single 3090 GPU:
python -m torch.distributed.launch main_finetune.py \
--model convnextv2_base \
--batch_size 32 --update_freq 4 \
--blr 6.25e-4 --epochs 100 --warmup_epochs 20 \
--layer_decay_type 'group' --layer_decay 0.6 --weight_decay 0.05 \
--drop_path 0.1 --reprob 0.25 \
--mixup 0.8 --cutmix 1.0 --smoothing 0.1 \
--model_ema True --model_ema_eval True \
--use_amp True \
--data_path /path/to/Dataset/RAF-DB/basic \
--finetune '/path/out_dir_base/convnextv2_base_checkpoint-320.pth'
| name | resolution | RAF-DB Acc(%) | AffectNet-7 Acc(%) | AffectNet-8 Acc(%) | FERPlus Acc(%) | #params | model |
|---|---|---|---|---|---|---|---|
| MAE ViT-Base | 224x224 | 91.07 | 66.09 | 62.42 | 90.18 | 86.5M | model |
| MAE ViT-Small | 224x224 | 90.03 | 65.53 | 62.06 | 89.35 | 21.9M | model |
| MAE ViT-Tiny | 224x224 | 88.72 | 64.25 | 61.45 | 88.67 | 5.6M | model |
| ConvNeXt V2-B | 224x224 | 89.52 | - | - | - | 89M | model |
The accuracy of MAE ViT-Base increased to 91.79%, 63.81%, and 90.82% on RAF-DB, AffectNet-8, and FERPlus respectively with data augmentation tricks.
Additional weights for ViT-Small(600 epochs) and ViT-Tiny(600 / 800 / 1000 epochs) trained for more epochs are available.
If you find this repo helpful, please consider citing:
@article{li2024emotion,
title={Emotion separation and recognition from a facial expression by generating the poker face with vision transformers},
author={Li, Jia and Nie, Jiantao and Guo, Dan and Hong, Richang and Wang, Meng},
journal={IEEE Transactions on Computational Social Systems},
year={2024},
publisher={IEEE}
}
@article{chen2024static,
title={From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos},
author={Chen, Yin and Li, Jia and Shan, Shiguang and Wang, Meng and Hong, Richang},
journal={IEEE Transactions on Affective Computing},
year={2024},
publisher={IEEE}
}