A Keras Implementation of YOLO v3 for Custom Model Training with Azure Machine Learning

Keras is a deep learning framework that operates as a binding to lower level frameworks such as TensorFlow and CNTK. Azure Machine Learning, an ML platform integrated with Microsft Azure for data prep, experimentation and model deployment, is exposed through a Python SDK (used here) and extension to the Azure CLI. Together a custom Keras model for object detection is trained using the code and instruction in theis repo. The ML practitioner must bring their own custom data to this process - hence any object detector can be trained by following the process below.

This work is based on:

Keras YOLOv3 implementation for object detection https://github.com/qqwweee/keras-yolo3
A fork for custom data https://github.com/michhar/keras-yolo3-custom (this repo is the Azure ML implementation).

YOLO stands for "you only look once" and is an efficient algorithm for object detection. The following image is showing the results from a trained car detector.

Important papers on YOLO:

Original - https://arxiv.org/abs/1506.02640
9000/v2 - https://arxiv.org/abs/1612.08242
v3 - https://arxiv.org/abs/1804.02767
v4 (new!) - https://arxiv.org/pdf/2004.10934.pdf

There are "tiny" versions of the architecture, often considered for embedded/constrained devices.

Website: https://pjreddie.com/darknet/yolo/ (provides information on a framework called Darknet)

This implementation of YOLOv3 (Tensorflow backend) was inspired by allanzelener/YAD2K.

What You Do Here

Provision required resources in Azure
- Azure ML Workspace
- Azure Storage Account
- Service Principal authentication for Workspace
Install prerequisite libraries
Register base model to Azure ML Workspace
Upload images or video to Storage
Label data with VoTT and export to Storage
Use driver Python script to train a model in the cloud
Download final model
Perform inference on video from camera locally
Deploy the solution locally and in the cloud as a REST API
Convert from Keras to ONNX format

Note: The training script automatically calculates the optimal sizes for the anchor boxes and updates a config file for YOLOv3. It also uses the config file to convert a pretrained Darknet model to Keras format for the custom number of classes so the user does not need to perform these manual steps.

Prerequisites

Azure Subscription (free trial link in upper, right corner)
Python 3.6+ installed with pip
Visual Object Tagging Tool
Azure CLI
Command prompt or terminal

System setup

The user's host machine (or VM if using one) for development (Windows, macOS, Linux) is refered to as the "local" or "dev" machine. The machine or compute that trains the Azure ML models is refered to as the "target compute". There is also the idea of the deployment machine. For Azure ML this can be a local machine, Azure VM, Azure Container Instance, Azure Kubernetes Cluster, etc. (many more deployment target options).

In this repo, we have the local and the target machine setups. The different setups are shown in the diagram below. There are environment variables and Python packages that differ on each setup. For example, Azure Storage related environment variables are set on the local or user dev machine and used through the command prompt or terminal.

IMPORTANT: If on Windows please use a the terminal window or command prompt rather than PowerShell window for all the the subsequent commands.

Provision required resources in Azure

Log in to Azure with the Azure CLI using your Azure credentials on the command line (this will also involve an interactive browser experience) - az login (additionally, you may need to switch in to the appropriate subscription with az account set --subscription <subscription id>)
Create an Azure ML Workspace (this sample will work with both Basic and Enterprise)

Download the config.json from the Azure ML Workspace in the Azure Portal and place in the project/.azureml folder. When using this file, interactive authentication will need to happen (vs. through a Service Principal).

Create an Azure Storage Account
Create a Service Principal.
- A Service Principal is the recommeded way for an unattended script or production system to authenticate with Azure ML for accessing a Workspace.

IMPORTANT: if you already have a Service Principal that you wish to use, you will still need to associate it with your Azure ML Workspace with the instructions in the link above.

Install prerequisite libraries

Create a Python virtual environment or Anaconda environment for this project (highly recommended).

Use the Python package manager to install the Azure ML SDK. Ensure using the intended pip (sometimes it's pip3). It is strongly recommended to use a virtual environment or conda environment for this project as very specific versions of packages are used and it makes debugging a local setup easier.

pip install azureml-sdk==1.5.0

You will also need the Python Azure Blob Storage package installed locally or wherever you run the raw data uploading Python script.

Note: take note the version number, here (you may have to uninstall azure-storage-blob if an older version is already installed).

pip install azure-storage-blob==12.3.1

Register base model to Azure ML Workspace

Download either the full-sized YOLO v3 or the tiny version.

Download full-sized YOLOv3 here: https://pjreddie.com/media/files/yolov3.weights.

Or download the tiny YOLOv3 here: https://pjreddie.com/media/files/yolov3-tiny.weights.

Then, place the model in the base folder of the cloned repository.

Run the following script to register the YOLO model to the Azure ML Workspace in the cloud (it uploads it in the background as well). --model-size should be either full or tiny.

python register_local_model_base.py --model-size tiny

Upload images or video to Storage

Define local environment variables as follows so that the upload script sends data to the right Azure Storage Account and container (this is going to be one single folder with all of the raw images to label - it will serve as input to the labeling process with VoTT in the next step).

Windows

Create a setenvs.cmd file with the following (do not include the <> characters) and use the command prompt instead of PowerShell for the set command to work.

set STORAGE_CONTAINER_NAME_RAWDATA=<Blob container name to store the raw image data before labeling>
set STORAGE_CONTAINER_NAME_TRAINDATA=<Blob container name for labeled data for training with Azure ML>
set STORAGE_ACCOUNT_NAME=<Storage account name>
set STORAGE_ACCOUNT_KEY=<Storage account key>
set STORAGE_CONNECTION_STRING=<Storage account connection string for the upload script>

Run this setenvs.cmd on the command line in Windows and use the same terminal to run subsequent scripts. Alternatively, each "set" command can be run on the command line separately - just make sure to use the same terminal window later on.

Linux/MacOS

Create a .setenvs file (or use any name) with the following (do not include the <> characters; on unix the STORAGE_CONNECTION_STRING does need the quotation marks due to special characters).

export STORAGE_CONTAINER_NAME_RAWDATA=<Blob container name to store the raw image data before labeling>
export STORAGE_CONTAINER_NAME_TRAINDATA=<Blob container name for labeled data for training with Azure ML>
export STORAGE_ACCOUNT_NAME=<Storage account name>
export STORAGE_ACCOUNT_KEY=<Storage account key>
export STORAGE_CONNECTION_STRING="<Storage account connection string for the upload script>"

Set these environment variables in the terminal window with the command: source .setenvs. Make sure to use the same terminal window later on for running subsequent scripts.

Check that the environment variables were set correctly in Unix environments with the following command.

env | more

Create a local folder of raw images to be labeled and place all images into this folder. Upload that folder to Azure Blob Storage and a specific container with the following script. This container will then be accessed by the labeling tool, VoTT, in the next step.

If the images are in the data folder, for example, the script would be run as follows in the same terminal window in which the Storage-related environment variables were set.

python upload_to_blob.py --dir data

Note: the default access level is "Private (no anonymous access)" and can be changed with the SDK, as well as, in the Azure Portal by navigating to the container within the Storage account and selecting "Change access level".

Label data with VoTT and export to Storage

VoTT is the labeling tool that will be used locally to label data that is stored in the cloud and will write the labels directly back to a cloud store using a SAS token for REST authentication. The tool imports from and exports directly to Azure Blob Storage containers specified while using the tool with no need to download data for labeling. The raw images to label should exist already in a private Blob Storage container (the data from the previous step above).

Note: The raw image container will form the "Source Connection" and a new container in that Storage Account will form the "Target Connection."

Use VoTT 2 (link to download) labeling tool to create and label bounding boxes and export. The Readme on the project page has good instructions, but the following should help clarify a few points.

The Source Connection and Target Connection (associated with two different containers) from and to Blob Storage will use a SAS token for authentication. Create the "SAS" or shared access signature in the Azure Portal in the Storage Account under the "Shared access signature" blade and set the permissions for "Service", "Container" and "Object".

Create it to expire whenever it makes sense for your scenario.

For VoTT connection use the "SAS token" from the Azure Portal (it should start with a ?).

IMPORTANT: A word of caution in using VoTT v2.1.0 - if adding more raw data to the input Storage container/connection to label some more images, make sure to keep the same output container/connection for the labels, otherwise the old labels may not transfer over. Also, ensure not to delete any files in the output storage container as it would interfere with current labeling (labels may disappear).

There will be one <random guid>-asset.json file per image in the base of the Blob Storage container (target connection) containing the saved labels for VoTT. Do not delete any of these files, otherwise when opening the project again, the labels will have been removed in the UI, as well, and can not be recovered.

Once you are done labeling, go to the Export settings and select Pascal VOC. Then use the export button at the top of the app to export to the VoTT Target Connection defined earlier.

Your output in the cloud will have one folder with three subfolders in the new Storage container (the connection name used with VoTT). Check the Azure Portal that this is the case.

The structure should look similiar to the following, but with your name for the project (there will be many other files present, something-asset.json files, in the output Storage container along with this folder).

/ProjectName-PascalVOC-export
  /Annotations
  /ImageSets
  /JPEGImages

The annotations and images in this Storage container will then be used in the training script (mounted by Azure ML as a Data Store).

IMPORTANT: It is good to check the "access level" in the Azure Portal of the exported images and labels, the Blob Storage container used in the connection - ensure it has "Private" access level if you do not wish the data to be Public. To change, navigate to the container and select "Change access level".

Finally, to get back to the "Cloud Project" in VoTT, simply open VoTT 2, select "Open Cloud Project" and select the "target" connection or output connection and the .vott file (this is stored in Blob Storage container).

Use driver Python script to train a model in the cloud

In order to use non-interactive authentication on the compute target, create myenvs.txt with the Service Principal information and place it in the project folder. Remember, the project folder and its content are uploaded to the Azure ML compute target of your choosing and this is where these variables are utilized. Create this file (it must be called myenvs.txt) with the following filled in with your information (do not include the <> characters). This information should have been obtained from the Service Principal creation step in the Provision required resources in Azure section.

AML_TENANT_ID=<Tenant ID>
AML_PRINCIPAL_ID=<Client ID>
AML_PRINCIPAL_PASS=<Client Password>
SUBSCRIPTION_ID=<Subscription ID>
WORKSPACE_NAME=<Azure ML Workspace Name>

Define the class names in a file called custom_classes.txt and place it in the project folder, each class on a separate line, as in the following 2 class file example.

object
no_object

The training script, project/train_azureml.py does the following.

Calculates achor box sizes
Creates the proper config file (proper filter, anchor and class numbers)
Converts the YOLO v3 Darknet weights to Keras weights
Trains the YOLO model

Tracking the loss with Azure ML workspace (can be monitored from the Portal)

Saves the models to the outputs folder (the folder that persists after training in the Workspace)

Technical note on anchor boxes: filters=(classes + 5)x3 in the 3 [convolutional] before each [yolo] (source).

The driver script, azureml_driver.py, wrapping the training process, does the following.

Creates or reinitializes the target compute
Registers the Data Store (using the labeled data container and credentials found in local machine's environment variables) to Azure ML Workspace
Defines the TensorFlow Estimator with some of the command line arguments to point to the training script
Submits a Run under the Experiment name given in the command line arguments that runs the training script
Registers the intermediate and final model to the Azure ML Workspace for later use (e.g. in deployments)

Ensure the Azure Storage environment variables are still set in the current terminal. See Upload images or video to Storage section to set these again if needed.

To train the model with Azure ML run the driver, as in the following example (all arguments are required).

python azureml_driver.py --experiment-name carsv1 --gpu-num 1 --class-path custom_classes.txt --data-dir Traffic-PascalVOC-export --num-clusters 6 --ds-name trafficstore --bs 4 --lr 0.001

For help on using this script (as with the other scripts), run with -h.

python azureml_driver.py -h

Download final model

Navigate to the Azure ML Workspace in the Azure Portal and go to the Experiment -> Run -> Outputs tab and download:

Model (trained_weights_final.h5)
custom_anchors.txt

Place these two files in the project folder so our inference script in the next step can find them.

Perform local inference on images or video

To understand how to perform inference with the script, change directory into the project folder and run: python yolo_video.py -h to see the help and argument options. You can perform inference on single images, live video feed or a recorded video. Output can be a labeled image, labeled streaming video or a labeled video output file.

Inference on an image

Change directory into project.

In addition to other arguments, use --image

Example: python yolo_video.py --model-path trained_weights_final.h5 --anchors-path custom_anchors.txt --classes-path custom_classes.txt --conf 0.5 --image

Inference on video from a webcam

Note: on linux video0 is usually the built-in camera (if this exists) and a USB camera may be used so look for video1 etc. (if there is not camera, then video0 is usually USB cam). On MacOS, use for --input 0, for built-in, and 1 for USB. This should be the same on Windows.

In addition to other arguments, use --input <video device id>

Example: python yolo_video.py --model-path trained_weights_final.h5 --anchors-path custom_anchors.txt --classes-path custom_classes.txt --conf 0.5 --input 0

Inference on video file and output to a video file

In addition to other arguments, use --input <video file name> --output xyz.mov

Example: python yolo_video.py --model-path trained_weights_final.h5 --anchors-path custom_anchors.txt --classes-path custom_classes.txt --conf 0.5 --input <path to video>/some_street_traffic.mov --output some_street_traffic_with_bboxes.mov

Deploying the solution locally and in the cloud

Steps are as follows. Ensure you are logged in to the Azure CLI with your Azure credentials and have selected the correct subscription. At any time for help with script, use: python <scriptname>.py --help.

If the model is not yet registered to the workspace, download the model.

You will need to register it from your local environment to the Azure ML workspace. Use the following script as in (e.g.):

Example: python register_local_model_custom.py --model-local ep045-loss12.620.h5 --model-workspace carsv1-2class-tiny-yolov3.h5 --description "Tuned tiny YOLO v3 Keras model for car-truck 2-class object detection trained on Bing search images. From carsv1 experiment, Run 1."

Note: description must not exceed 200 characters and model name in Workspace (--model-workspace argument above) must not exceed 32 characters.

Download custom_anchors.txt from the Outputs tab in the Run and place in the deploy folder in this repo.
Place a class labels file called custom_classes.txt in to the deploy folder (this has the class names one per line and order matters, here).
Change line 20 of score.py in the deploy folder to use the Workspace name of model and correct version number. (e.g. model_root = Model.get_model_path('carsv1-2class-tiny-yolov3.h5', version=1))

Note: it is a good idea to test a deployment locally first so we will do that, now.

Perform a local test deployment (if you have docker locally and are able to run without elevated priviledges). This will build a scoring/inferencing image (actually built and stored in the cloud, pulling in the cloud registered model) and then run as a container locally using your local docker program.

Example: python deploy_to_local.py --model-workspace carsv1-2class-tiny-yolov3.h5 --service-name cars-service-local

Record the location of the local webservice (you will add on /score to make the scoring url.)

Test the local deployment. Set an environment variable SCORING_URI to the scoring URL (webservice location with /score at the end) before running. Use --image to specify a test image to use.

Example: python test_service.py --image car_test1.jpg

Perform a cloud test deployment with an Azure Container Instance.

Note: for production deployments in the cloud, it is recommeded to use Azure Kubernetes Service for managed scale.

Example: python deploy_to_aci.py --service-name cars-service-aci --model-workspace carsv1-2class-tiny-yolov3.h5 --description "Cars ACI service - tiny Keras YOLO v3 model"

Test the cloud deployment in the same way as the local deployment, but use the cloud scoring URI as the SCORING_URI, and since we set auth_enabled=True in the deployment configuration. We will also need a local environment variable WEBSERVICE_KEY. Get the scoring URI and key in the Azure Portal under the Azure ML Workspace and Deployments.

Example: python test_service.py --image car_test1.jpg

Convert to ONNX format

For use in the ONNX runtime () we can convert from Keras to ONNX format with a script deploy/convert_keras2onnx.py. As before, use --help to get the argument list.

Ensure you have the following libraries and versions:

tensorflow==1.15.2
keras==2.2.4
keras2onnx==1.7.0
onnx==1.6.0

This example converts a custom 2-class, tiny YOLO v3 model to ONNX format (change --num-clusters to 9 for full-sized YOLO v3 model).

Example: python convert_keras2onnx.py --model-local carsv1-2class-tiny-yolov3.h5 --classes-path custom_classes.txt --anchors-path custom_anchors.txt --name carsv1-2class-tiny-yolov3.onnx --num-clusters 6

Next, we can test the inferencing process with the ONNX runtime (ensure conversion worked correctly) and benchmark seconds per inference event as in the example below, where --image is your own image.

Example: python onnx_inference.py --model-local carsv1-2class-tiny-yolov3.h5 --anchors-path custom_anchorst.txt --classes-path custom_classes.txt --conf 0.5 --image cars_and_trucks.jpg

Kudos

Congratulations for successfully training and deploying your YOLO v3 model with Azure and Azure ML! You should pat yourself on the back for a job well done.

Credits

All YOLO v3 code based on https://github.com/qqwweee/keras-yolo3 project.

References

Troubleshooting

Blob Storage upload issue:

  File "upload_to_blob.py", line 9, in <module>
    from azure.storage.blob import BlobServiceClient
ImportError: cannot import name 'BlobServiceClient'

Solve this by uninstalling and re-installing the Blob Storage Python package to a more recent version (you may be on a legacy version).

pip uninstall azure-storage-blob
pip install azure-storage-blob==12.3.1

Driver script issue: utils.py", line 81, in kmeans assert False - this could mean:
- There's an issue with your Datastore. Check the Datastore in the Portal under your Azure ML Workspace to make sure it's pointing to the correct Blob Storage account and container. Then, check the Blob Storage container to ensure it has the --data-dir that you specified when running the driver script (e.g. Traffic-PascalVOC-export) at the base level of the container. You may need to define environment variables for driver script to locate these resources. See, Use driver Python script to train a model in the cloud and "important" note.
- There are images in the Datastore that are not tagged. Go back to the VoTT tool, label the untagged images or set the project to export only tagged images (under Export Settings) and run the export again.
- The folder structure for the labeled data in Azure Storage is not correct. The folder structure should be as follows (note VoTT adds a prefix sometimes to the xml and jpg files like "data%2F").
```
\your_container_name
  \Your_VoTT_Projec_tName-PascalVOC-export
    \Annotations
      data_image01.xml
      data_image02.xml
      data_image03.xml
      ...
  \ImageSets
    \Main
      object1_train.txt
      object1_val.txt
      object2_train.txt
      object2_val.txt
      ...
  \JPEGImages
    data_image01.jpg
    data_image02.jpg
    data_image03.jpg
    ...
  pascal_label_map.pbtxt
```
Here is a screenshot of the Annotations folder
:
If there is a ResourceExhaustedError with a message such as follows in the 70_driver_log.txt, then this means the GPU on the compute target ran out of GPU memory for the CUDA code/tensors.

"error": {
    "code": "UserError",
    "message": "User program failed with ResourceExhaustedError: OOM when allocating tensor with shape[8,38,38,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[{{node conv2d_66/convolution}}]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n\n\t [[{{node loss/add_74}}]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n",

Solve this by decreasing the batch size when running the driver script, azureml_driver.py. This would be setting the batch size argument --bs to a smaller value (2 instead of 4 for instance).

If, when running azureml_driver.py, registering the datastore (line 69) results in azureml._restclient.models.error_response.ErrorResponseException: (StorageAuthError) Forbidden. Please make sure you have permission to access the resource., then follow the instructions at Regenerate storage account access keys to re-register the datastore with updated Storage keys.

Extra notes

Recording video on Mac OS for testing example. From the command line with ffmpeg program (need to install first):

ffmpeg -f avfoundation -framerate 30 -i "0" -r 30 video.mpg

Where -r is framerate and -i is the camera input (0 is builtin camera on Mac OS).

Some issues to know

Default anchors can be used. If you use your own anchors, probably some changes are needed - the training script now calculates these automatically for you.
The inference result is not totally the same as Darknet but the difference is small.
The speed is slower than Darknet. Replacing PIL with OpenCV may help a little.
Always load pretrained weights and freeze layers in the first stage of training. Or try Darknet training. It's OK if there is a mismatch warning.
The training strategy is for reference only. Adjust it according to your dataset and your goal. And add further strategy if needed.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
cfg		cfg
deploy		deploy
docker/keras		docker/keras
project		project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
augmentation.py		augmentation.py
azureml_driver.py		azureml_driver.py
convert.py		convert.py
deploy_to_aci.py		deploy_to_aci.py
deploy_to_local.py		deploy_to_local.py
download_from_blob_legacy.py		download_from_blob_legacy.py
generate_deployment_dependencies.py		generate_deployment_dependencies.py
register_local_model_base.py		register_local_model_base.py
register_local_model_custom.py		register_local_model_custom.py
requirements.txt		requirements.txt
test_service.py		test_service.py
upload_to_blob.py		upload_to_blob.py
upload_to_blob_legacy.py		upload_to_blob_legacy.py

License

michhar/azureml-keras-yolov3-custom

Folders and files

Latest commit

History

Repository files navigation