1
1
# ** Russian Open STT Dataset**
2
2
3
3
Arguably the largest public Russian STT dataset up to date:
4
- - ~ 3m utterances;
5
- - 1,771+ hours;
6
- - 190GB;
7
- - Additional 3,000 hours ... and more ... to be released soon!;
4
+ - ~ 5m utterances;
5
+ - ~ 4,200 hours;
6
+ - 457GB;
7
+ - Additional 1,500 hours ... and more ... to be released soon!;
8
+ - And then maybe even more hours to be released!;
8
9
9
10
10
- Prove [ me ] ( https://t.me/snakers41 ) wrong!
11
+ Prove [ us ] ( https://t.me/snakers41 ) wrong!
11
12
Open issues, collaborate, submit a PR, contribute, share your datasets!
12
13
Let's make STT in Russian (and more) as open and available as CV models.
13
14
14
15
15
16
# ** Dataset composition**
16
17
17
- | Dataset | Utterances | Hours | GB | Av len/chars | Comment | Annotation | Quality/noise |
18
- | -------------------------------| ------------| -------| -----| --------------| ------------------| ---------------| ---------------|
19
- | asr_public_phone_calls_2 (* ) | | 1,500 | | | * Coming soon | | |
20
- | public_youtube1500 (* ) | | 1,500 | | | * Coming soon | | |
21
- | tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS, 4 voices | 100% / crisp |
22
- | public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | >95% / ~ crisp |
23
- | asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
24
- | asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 70% / crisp |
25
- | public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~ crisp |
26
- | ru_RU | 5,826 | 17 | 2 | 10.8s / 12 | Public dataset | Alignment | 99% / crisp |
27
- | voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
28
- | russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
29
- | public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | >95% / crisp |
30
- | Total | 2,825,904 | 1,771 | 190 | | | | |
18
+ | Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
19
+ | ---------------------------| ------------| -------| -----| ------------| ------------------| -------------| ---------------|
20
+ | public_youtube1500 (* ) | | 1,500 | | | * Coming soon | | |
21
+ | audiobook_2 | 1,149,404 | 1,511 | 166 | 4.7s / 56 | Books | Alignment | 99% / crisp |
22
+ | audiobook_1 | 196,666 | 237 | 26 | 4.3s / 50 | Books | Alignment | 99% / crisp |
23
+ | public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~ crisp |
24
+ | tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS 4 voices| 100% / crisp |
25
+ | asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
26
+ | asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
27
+ | asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
28
+ | asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
29
+ | public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~ crisp |
30
+ | ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp |
31
+ | voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
32
+ | russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
33
+ | public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
34
+ | Total | 4,853,957 | 4,198 | 457 | | | | |
31
35
32
36
# ** Downloads**
33
37
34
38
## ** Links**
35
39
36
- Meta data [ file] ( https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v02.csv ) .
40
+ Meta data [ file] ( https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v03.csv ) .
41
+
37
42
38
43
| Dataset | GB | GB, compressed | Audio | Source | Manifest |
39
44
| ---------------------------------------| ------| ----------------| -------| -------| ----------|
45
+ | audiobook_1 | 26 | 20.8 | [ part1] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_1.tar.gz ) | Public books + alignment | [ link] ( https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_1.csv ) |
46
+ | audiobook_2 | 166 | 131.7 | [ part1] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_aa ) , [ part2] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ab ) , [ part3] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ac ) , [ part4] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ad ) , [ part5] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ae ) , [ part6] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_af ) , [ part7] ( https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ag ) | Public books + alignment | [ link] ( https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_2.csv ) |
47
+ | asr_public_phone_calls_2 | 66 | 51.7 | [ part1] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_aa ) , [ part2] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ab ) , [ part3] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ac ) | ASR + public phone calls | [ link] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.csv ) |
48
+ | asr_public_stories_2 | 9 | 7.5 | [ part1] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.tar.gz ) | Public books + alignment | [ link] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.csv ) |
40
49
| tts_russian_addresses_rhvoice_4voices | 80.9 | 67.0 | [ part1] ( https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_aa ) , [ part2] ( https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ab ) , [ part3] ( https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ac ) , [ part4] ( https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ad ) | TTS | [ link] ( https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.csv ) |
41
50
| public_youtube700 | 75.0 | 67.0 | [ part1] ( https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_aa ) , [ part2] ( https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ab ) , [ part3] ( https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ac ) , [ part4] ( https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ad ) | YouTube videos | [ link] ( https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.csv ) |
42
51
| asr_public_phone_calls_1 | 22.7 | 19.0 | [ part1] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.tar.gz ) | ASR + public phone calls | [ link] ( https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.csv ) |
@@ -71,6 +80,226 @@ Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_dat
71
80
2 . Download the meta data and manifests for each dataset:
72
81
3 . Merge files (where applicable), unpack and enjoy!
73
82
83
+ ## ** Check md5sum**
84
+
85
+ ` md5sum /path/to/downloaded/file `
86
+
87
+ <details >
88
+ <summary >Click to expand</summary >
89
+ <table >
90
+ <tr>
91
+ <th>type</th>
92
+ <th>md5sum</th>
93
+ <th>file</th>
94
+ </tr>
95
+ <tr>
96
+ <td>manifest</td>
97
+ <td>b0ce7564ba90b121aeb13aada73a6e30</td>
98
+ <td>asr_public_phone_calls_1.csv</td>
99
+ </tr>
100
+ <tr>
101
+ <td>manifest</td>
102
+ <td>6867d14dfdec1f9e9b8ca2f1de9ceda6</td>
103
+ <td>asr_public_phone_calls_2.csv</td>
104
+ </tr>
105
+ <tr>
106
+ <td>manifest</td>
107
+ <td>0bdd77e15172e654d9a1999a86e92c7f</td>
108
+ <td>asr_public_stories_1.csv</td>
109
+ </tr>
110
+ <tr>
111
+ <td>manifest</td>
112
+ <td>f388013039d94dc36970547944db51c7</td>
113
+ <td>asr_public_stories_2.csv</td>
114
+ </tr>
115
+ <tr>
116
+ <td>manifest</td>
117
+ <td>697738331b6021890c29a0d415d0f22d</td>
118
+ <td>private_buriy_audiobooks_1.csv</td>
119
+ </tr>
120
+ <tr>
121
+ <td>manifest</td>
122
+ <td>3b67e27c1429593cccbf7c516c4b582d</td>
123
+ <td>private_buriy_audiobooks_2.csv</td>
124
+ </tr>
125
+ <tr>
126
+ <td>manifest</td>
127
+ <td>04027c20eb3aff05f6067957ecff856b</td>
128
+ <td>public_lecture_1.csv</td>
129
+ </tr>
130
+ <tr>
131
+ <td>manifest</td>
132
+ <td>89da3f1b6afcd4d4936662ceabf3033e</td>
133
+ <td>public_series_1.csv</td>
134
+ </tr>
135
+ <tr>
136
+ <td>manifest</td>
137
+ <td>a81dfb018c88d0ecd5194ab3d8ff6c95</td>
138
+ <td>public_youtube700.csv</td>
139
+ </tr>
140
+ <tr>
141
+ <td>manifest</td>
142
+ <td>c858f020729c34ba0ab525bbb8950d0c</td>
143
+ <td>ru_RU.csv</td>
144
+ </tr>
145
+ <tr>
146
+ <td>manifest</td>
147
+ <td>0275525914825dec663fd53390fdc9a0</td>
148
+ <td>russian_single.csv</td>
149
+ </tr>
150
+ <tr>
151
+ <td>manifest</td>
152
+ <td>52f406f4e30fcc8c634f992befd91beb</td>
153
+ <td>tts_russian_addresses_rhvoice_4voices.csv</td>
154
+ </tr>
155
+ <tr>
156
+ <td>audio</td>
157
+ <td>a5496898ee78654bf398ec6df71540d7</td>
158
+ <td>asr_public_phone_calls_1.tar.gz</td>
159
+ </tr>
160
+ <tr>
161
+ <td>audio</td>
162
+ <td>e4df5ef50787384648b59f5a87edc0c6</td>
163
+ <td>asr_public_phone_calls_2.tar.gz</td>
164
+ </tr>
165
+ <tr>
166
+ <td>audio</td>
167
+ <td>97594127a922df8a7bcc2eecd2470805</td>
168
+ <td>asr_public_phone_calls_2.tar.gz_aa</td>
169
+ </tr>
170
+ <tr>
171
+ <td>audio</td>
172
+ <td>f9b6475f0f2898b16d9e6e0e648fb531</td>
173
+ <td>asr_public_phone_calls_2.tar.gz_ab</td>
174
+ </tr>
175
+ <tr>
176
+ <td>audio</td>
177
+ <td>b19977c889cda639f621195251e6bb6f</td>
178
+ <td>asr_public_phone_calls_2.tar.gz_ac</td>
179
+ </tr>
180
+ <tr>
181
+ <td>audio</td>
182
+ <td>657a31b544b10295f909ef4b2ca5c156</td>
183
+ <td>asr_public_stories_1.tar.gz</td>
184
+ </tr>
185
+ <tr>
186
+ <td>audio</td>
187
+ <td>7533581bb26975212817bcacb25546d0</td>
188
+ <td>asr_public_stories_2.tar.gz</td>
189
+ </tr>
190
+ <tr>
191
+ <td>audio</td>
192
+ <td>d7d374025c56ca556d9cde86b9fdffda</td>
193
+ <td>audiobooks_1.tar.gz</td>
194
+ </tr>
195
+ <tr>
196
+ <td>audio</td>
197
+ <td>3955616cd89761bf2d54d0e992f7eae5</td>
198
+ <td>audiobooks_2.tar.gz_aa</td>
199
+ </tr>
200
+ <tr>
201
+ <td>audio</td>
202
+ <td>81b6ec147c0c43bdd56002c41e0288b8</td>
203
+ <td>audiobooks_2.tar.gz_ab</td>
204
+ </tr>
205
+ <tr>
206
+ <td>audio</td>
207
+ <td>15d4cf99171c2db3f375619f4bd2b6d9</td>
208
+ <td>audiobooks_2.tar.gz_ac</td>
209
+ </tr>
210
+ <tr>
211
+ <td>audio</td>
212
+ <td>50635b0f4bdf44fae96e5a65f4738e19</td>
213
+ <td>audiobooks_2.tar.gz_ad</td>
214
+ </tr>
215
+ <tr>
216
+ <td>audio</td>
217
+ <td>f1103be39ffc2da4a98d8f6ddeb50aa0</td>
218
+ <td>audiobooks_2.tar.gz_ae</td>
219
+ </tr>
220
+ <tr>
221
+ <td>audio</td>
222
+ <td>8b45d2bd8b1fa1d906e36b9fabd9fe4c</td>
223
+ <td>audiobooks_2.tar.gz_af</td>
224
+ </tr>
225
+ <tr>
226
+ <td>audio</td>
227
+ <td>5104df44933b612b3c1bfc06f6376654</td>
228
+ <td>audiobooks_2.tar.gz_ag</td>
229
+ </tr>
230
+ <tr>
231
+ <td>audio</td>
232
+ <td>e6b9e5f46811d33ea34ce50f6067a762</td>
233
+ <td>public_lecture_1.tar.gz</td>
234
+ </tr>
235
+ <tr>
236
+ <td>audio</td>
237
+ <td>86ebf7e30986b8ee8df11f85b35588a0</td>
238
+ <td>public_series_1.tar.gz</td>
239
+ </tr>
240
+ <tr>
241
+ <td>audio</td>
242
+ <td>dc260dd8151b4fce6cde6d80af13146d</td>
243
+ <td>public_youtube700.tar.gz_aa</td>
244
+ </tr>
245
+ <tr>
246
+ <td>audio</td>
247
+ <td>04706ef0f98841ec8d2f20a83aca3cf1</td>
248
+ <td>public_youtube700.tar.gz_ab</td>
249
+ </tr>
250
+ <tr>
251
+ <td>audio</td>
252
+ <td>e11d5b118bf71425e4915e61277a06a9</td>
253
+ <td>public_youtube700.tar.gz_ac</td>
254
+ </tr>
255
+ <tr>
256
+ <td>audio</td>
257
+ <td>d9a93157263eb9d8078c0e0b88c271de</td>
258
+ <td>public_youtube700.tar.gz_ad</td>
259
+ </tr>
260
+ <tr>
261
+ <td>audio</td>
262
+ <td>1bbba5eb2f4911c9ed20ec69cbd292cb</td>
263
+ <td>ru_ru.tar.gz</td>
264
+ </tr>
265
+ <tr>
266
+ <td>audio</td>
267
+ <td>6f79a9c514ad48a5763e3142919fc765</td>
268
+ <td>russian_single.tar.gz</td>
269
+ </tr>
270
+ <tr>
271
+ <td>audio</td>
272
+ <td>c926df1068218eb9cc8103c94003fcc6</td>
273
+ <td>tts_russian_addresses_rhvoice_4voices.tar</td>
274
+ </tr>
275
+ <tr>
276
+ <td>audio</td>
277
+ <td>31d515e0bdfc467c3fe63088b817c15c</td>
278
+ <td>tts_russian_addresses_rhvoice_4voices.tar.gz_aa</td>
279
+ </tr>
280
+ <tr>
281
+ <td>audio</td>
282
+ <td>4ca15694a8d8a638bbdc5e90832eadb4</td>
283
+ <td>tts_russian_addresses_rhvoice_4voices.tar.gz_ab</td>
284
+ </tr>
285
+ <tr>
286
+ <td>audio</td>
287
+ <td>447559a38cd8bf61c5de64e602f06da3</td>
288
+ <td>tts_russian_addresses_rhvoice_4voices.tar.gz_ac</td>
289
+ </tr>
290
+ <tr>
291
+ <td>audio</td>
292
+ <td>9131347a97c2e794d7c6d5a265083e83</td>
293
+ <td>tts_russian_addresses_rhvoice_4voices.tar.gz_ad</td>
294
+ </tr>
295
+ <tr>
296
+ <td>audio</td>
297
+ <td>91e2115b17b1ad08649f428d2caa643b</td>
298
+ <td>voxforge_ru.tar.gz</td>
299
+ </tr>
300
+ </table >
301
+ </details >
302
+
74
303
# ** Annotation methodology**
75
304
76
305
The dataset is compiled using open domain sources.
@@ -105,43 +334,55 @@ store_path = Path(root_folder,
105
334
106
335
Use helper functions from here for easier work with manifest files.
107
336
108
- Read manifests:
109
- ```
337
+ #### ** Read manifests**
338
+ <details ><summary >See example</summary >
339
+ <p >
340
+
341
+ ``` python
110
342
from utils.open_stt_utils import read_manifest
111
343
112
344
manifest_df = read_manifest(' path/to/manifest.csv' )
113
345
```
114
346
115
- Merge, check and save manifests:
116
- ```
347
+ </p >
348
+ </details >
349
+
350
+ #### ** Merge, check and save manifests**
351
+ <details ><summary >See example</summary >
352
+ <p >
353
+
354
+ ``` python
117
355
from utils.open_stt_utils import (plain_merge_manifests,
118
356
check_files,
119
357
save_manifest)
120
-
121
358
train_manifests = [
122
- 'path/to/manifest1.csv',
123
- 'path/to/manifest2.csv',
359
+ ' path/to/manifest1.csv' ,
360
+ ' path/to/manifest2.csv' ,
124
361
]
125
-
126
- train_manifest = plain_merge_manifests(train_manifests,
362
+ train_manifest = plain_merge_manifests(train_manifests,
127
363
MIN_DURATION = 0.1 ,
128
364
MAX_DURATION = 100 )
129
365
check_files(train_manifest)
130
-
131
366
save_manifest(train_manifest,
132
367
' my_manifest.csv' )
133
368
```
134
369
370
+ </p >
371
+ </details >
372
+
135
373
# ** Contacts**
136
374
137
- Please contact me [ here] ( https://t.me/snakers41 ) or just create a GitHub issue!
375
+ Please contact us [ here] ( https://t.me/snakers41 ) or just create a GitHub issue!
138
376
139
377
# ** FAQ**
140
378
141
379
## ** 1. Issues with reading files**
142
380
143
- Maybe try this approach:
144
- ```
381
+ #### ** Maybe try this approach:**
382
+ <details ><summary >See example</summary >
383
+ <p >
384
+
385
+ ``` python
145
386
from scipy.io import wavfile
146
387
147
388
sample_rate, sound = wavfile.read(path)
@@ -151,6 +392,10 @@ sound = sound.astype('float32')
151
392
if abs_max> 0 :
152
393
sound *= 1 / abs_max
153
394
```
395
+
396
+ </p >
397
+ </details >
398
+
154
399
## ** 2. Why share such dataset?**
155
400
156
401
We are not altruists, life just is ** not a zero sum game** .
@@ -163,3 +408,11 @@ Consider the progress in computer vision, that was made possible by:
163
408
164
409
TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English.
165
410
Ultimately it leads to worse-off situation for the general community.
411
+
412
+ ## ** 3. Known issues with the dataset to be fixed**
413
+ - Blank files in Youtube dataset. Just filter them out using meta-data. Will be fixed in future;
414
+ - Some files that have low values / crash with tochaudio;
415
+ - Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;
416
+
417
+ # ** License**
418
+ Dual license, cc-by-nc and commercial usage available after agreement with dataset authors
0 commit comments