This repository contains code for the Spoter embedding model explained in this blog post. The model is heavily based on Spoter which was presented in Sign Pose-Based Transformer for Word-Level Sign Language Recognition with one of the main modifications being that this is an embedding model instead of a classification model. This allows for several few-shot tasks on unseen Sign Language datasets from around the world. More details about this are shown in the blog post mentioned above.
Modifications on SPOTER
Here is a list of the main modifications made on Spoter code and model architecture:
- The output layer is a linear layer but trained using triplet loss instead of CrossEntropyLoss. The output of the model is therefore an embedding vector that can be used for several downstream tasks.
- We started using the keypoints dataset published by Spoter but later created new datasets using BlazePose from Mediapipe (as it is done in Spoter 2). This improves results considerably.
- We select batches in a way that they contain several hard triplets and then compute the loss on all hard triplets found in each batch.
- Some code refactoring to acomodate new classes we implemented.
- Minor code fix when using rotate augmentation to avoid exceptions.
We used the silhouette score to measure how well the clusters are defined during the training step. Silhouette score will be high (close to 1) when all clusters of different classes are well separated from each other, and it will be low (close to -1) for the opposite. Our best model reached 0.7 on the train set and 0.1 on validation.
While the model was not trained with classification specifically in mind, it can still be used for that purpose. Here we show top-1 and top-5 classifications which are calculated by taking the 1 (or 5) nearest vector of different classes, to the target vector.
To estimate the accuracy for LSA, we take a “train” set as given and then classify the holdout set based on the closest vectors from the “train” set. This is done using the model trained on WLASL100 dataset only.
The recommended way of running code from this repo is by using Docker.
Clone this repository and run:
docker build -t spoter_embeddings .
docker run --rm -it --entrypoint=bash --gpus=all -v $PWD:/app spoter_embeddings
Running without specifying the
entrypoint
will train the model with the hyperparameters specified intrain.sh
If you prefer running in a virtual environment instead, then first install dependencies:
pip install -r requirements.txt
We tested this using Python 3.7.13. Other versions may work.
To train the model, run train.sh
in Docker or your virtual env.
The hyperparameters with their descriptions can be found in the training/train_arguments.py file.
Same as with SPOTER, this model works on top of sequences of signers' skeletal data extracted from videos. This means that the input data has a much lower dimension compared to using videos directly, and therefore the model is quicker and lighter, while you can choose any SOTA body pose model to preprocess video. This makes our model lightweight and able to run in real-time (for example, it takes around 40ms to process a 4-second 25 FPS video inside a web browser using onnxruntime)
For ready to use datasets refer to the Spoter repository.
For best results, we recommend building your own dataset by downloading a Sign language video dataset such as WLASL and then using the extract_mediapipe_landmarks.py
and create_wlasl_landmarks_dataset.py
scripts to create a body keypoints datasets that can be used to train the Spoter embeddings model.
You can run these scripts as follows:
# This will extract landmarks from the downloaded videos
python3 preprocessing.py extract -videos <path_to_video_folder> --output-landmarks <path_to_landmarks_folder>
# This will create a dataset (csv file) with the first 100 classes, splitting 20% of it to the test set, and 80% for train
python3 preprocessing.py create -videos <path_to_video_folder> -lmks <path_to_landmarks_folder> --dataset-folder=<output_folder> --create-new-split -ts=0.2
There are two Jupyter notebooks included in the notebooks
folder.
- embeddings_evaluation.ipynb: This notebook shows how to evaluate a model
- visualize_embeddings.ipynb: Model embeddings visualization, optionally with embedded input video
The code supports tracking experiments, datasets, and models in a ClearML server. If you want to do this make sure to pass the following arguments to train.py:
--dataset_loader=clearml
--tracker=clearml
Also make sure to correctly configure your clearml.conf file.
If using Docker, you can map it into Docker adding these volumes when running docker run
:
-v $HOME/clearml.conf:/root/clearml.conf -v $HOME/.clearml:/root/.clearml
Follow these steps to convert your model to ONNX, TF or TFlite:
- Install the additional dependencies listed in
conversion_requirements.txt
. This is best done inside the Docker container. - Run
python convert.py -c <PATH_TO_PYTORCH_CHECKPOINT>
. Add-tf
if you want to export TensorFlow and TFlite models too. - The output models should be generated in a folder named
converted_models
.
You can test your model's performance in a web browser. Check out the README in the web folder.
The code is published under the Apache License 2.0 which allows for both academic and commercial use if relevant License and copyright notice is included, our work is cited and all changes are stated.
The license for the WLASL and LSA64 datasets used for experiments is, however, the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license which allows only for non-commercial usage.
If you use this code in your research please cite us:
@misc{xmartlabs-2023-spoterembeddings,
author = {Pablo Grill and Gabriel Lema and Andres Herrera and Mathias Claassen},
title = {{SpoterEmbeddings}: Create embeddings from sign pose videos using {T}ransformers},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/xmartlabs/spoter-embeddings}}
}