You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-20Lines changed: 24 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,22 +20,22 @@ The solution supports two model configurations:
20
20
where:
21
21
22
22
***Realtime streams (RTS)** is the number of concurrent streams that can be serviced by a single accelerator
23
-
***p99 latency** is the 99th-percentile latency to process a single 60 ms audio frame and return any predictions. Note that latency increases with more concurrent streams.
23
+
***p99 latency** is the 99th-percentile latency to process a single 60 ms audio frame and return any predictions. Note that latency increases with the number of concurrent streams.
24
24
25
-
<sup>§</sup>The `large` model inference performance figures are provisional.
26
-
27
-
The **solution scales linearly with number of accelerators in the server** (tested up to 8000 RTS per server).
25
+
The **solution scales linearly up to 8 accelerators and we have measured a single server supporting 16000 RTS** with the `base` model.
28
26
29
27
The `base` and `large` configurations are optimised for inference on FPGA with Myrtle's IP to achieve high-utilisation of the available resources. They were chosen after hyperparameter searches on 10k-50k hrs of training data.
30
28
29
+
<sup>§</sup>The `large` model inference performance figures are provisional.
30
+
31
31
### Word Error Rates (WERs)
32
32
33
33
When training on the 50k hrs of open-source data described below, the solution has the following WERs:
These WERs are for streaming scenarios without additional forward context. Both configurations have a frame size of 60ms, so, for a given segment of audio, the model sees between 0 and 60ms of future context before making predictions.
41
41
@@ -48,28 +48,32 @@ The 50k hrs of training data is a mixture of the following open-source datasets:
48
48
49
49
This data has a `maximum_duration` of 20s and a mean length of 12.75s.
50
50
51
-
**<sup>*</sup>** None of these training data subsets include near-field unscripted utterances nor financial terminology. As such the Earnings21 benchmark is out-of-domain for these systems.
51
+
<sup>*</sup>None of these training data subsets include near-field unscripted utterances nor financial terminology. As such the Earnings21 benchmark is out-of-domain for these systems.
52
+
<sup>†</sup>`base` model WERs were not updated for the latest release. The provided values are from version [v1.6.1](https://github.com/MyrtleSoftware/myrtle-rnnt/releases/tag/v1.6.0).
52
53
53
54
### Training times <aname="train-timings"></a>
54
55
55
56
Training throughputs on an `8 x A100 (80GB)` system are as follows:
56
57
57
-
| Model | Training time | Throughput | No. of updates |per-gpu `batch_size`|`GRAD_ACCUMULATION_BATCHES`|
***Throughput** is the number of utterances seen per second during training (higher is better)
72
-
***No. of updates** is the number of optimiser steps at `GLOBAL_BATCH_SIZE=1024` that are required to train the models on the 50k hrs training dataset. You may need fewer steps when training with less data
73
-
***`GRAD_ACCUMULATION_BATCHES`** is the number of gradient accumulation steps per gpu required to achieve the `GLOBAL_BATCH_SIZE` of 1024. For all configurations the **per-gpu `batch_size`** is as large as possible meaning that `GRAD_ACCUMULATION_BATCHES` is set as small as possible.
73
+
***No. of updates** is the number of optimiser steps at `--global_batch_size=1024` that are required to train the models on the 50k hrs training dataset. You may need fewer steps when training with less data
74
+
***`grad_accumulation_batches`** is the number of gradient accumulation steps performed on each GPU before taking an optimizer step
75
+
***`batch_split_factor`** is the number of sub-batches that the `PER_GPU_BATCH_SIZE` is split into before these sub-batches are passed through the joint network and loss.
76
+
77
+
For more details on these hyper-parameters, including how to set them, please refer to the [batch size arguments](training/docs/batch_size_hyperparameters.md) documentation.
74
78
75
-
For more details on the batch size hyperparameters refer to the [Training Commands subsection of training/README.md](training/README.md#training). To get started with training see the [training/README.md](training/README.md).
79
+
To get started with training see the [training/README.md](training/README.md).
0 commit comments