Skip to content
This repository was archived by the owner on Oct 10, 2022. It is now read-only.

Commit dd6ac59

Browse files
committed
Add v0.3-alpha desciption
1 parent 6ed89f9 commit dd6ac59

File tree

1 file changed

+286
-33
lines changed

1 file changed

+286
-33
lines changed

README.md

Lines changed: 286 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,51 @@
11
# **Russian Open STT Dataset**
22

33
Arguably the largest public Russian STT dataset up to date:
4-
- ~3m utterances;
5-
- 1,771+ hours;
6-
- 190GB;
7-
- Additional 3,000 hours ... and more ... to be released soon!;
4+
- ~5m utterances;
5+
- ~4,200 hours;
6+
- 457GB;
7+
- Additional 1,500 hours ... and more ... to be released soon!;
8+
- And then maybe even more hours to be released!;
89

910

10-
Prove [me](https://t.me/snakers41) wrong!
11+
Prove [us](https://t.me/snakers41) wrong!
1112
Open issues, collaborate, submit a PR, contribute, share your datasets!
1213
Let's make STT in Russian (and more) as open and available as CV models.
1314

1415

1516
# **Dataset composition**
1617

17-
| Dataset | Utterances | Hours | GB | Av len/chars | Comment | Annotation | Quality/noise |
18-
|-------------------------------|------------|-------|-----|--------------|------------------|---------------|---------------|
19-
| asr_public_phone_calls_2 (*) | | 1,500 | | | * Coming soon | | |
20-
| public_youtube1500 (*) | | 1,500 | | | * Coming soon | | |
21-
| tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS, 4 voices | 100% / crisp |
22-
| public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | >95% / ~crisp |
23-
| asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
24-
| asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 70% / crisp |
25-
| public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
26-
| ru_RU | 5,826 | 17 | 2 | 10.8s / 12 | Public dataset | Alignment | 99% / crisp |
27-
| voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
28-
| russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
29-
| public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | >95% / crisp |
30-
| Total | 2,825,904 | 1,771 | 190 | | | | |
18+
| Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
19+
|---------------------------|------------|-------|-----|------------|------------------|-------------|---------------|
20+
| public_youtube1500 (*) | | 1,500 | | | * Coming soon | | |
21+
| audiobook_2 | 1,149,404 | 1,511 | 166 | 4.7s / 56 | Books | Alignment | 99% / crisp |
22+
| audiobook_1 | 196,666 | 237 | 26 | 4.3s / 50 | Books | Alignment | 99% / crisp |
23+
| public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp |
24+
| tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS 4 voices| 100% / crisp |
25+
| asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
26+
| asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
27+
| asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
28+
| asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
29+
| public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
30+
| ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp |
31+
| voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
32+
| russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
33+
| public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
34+
| Total | 4,853,957 | 4,198 | 457 | | | | |
3135

3236
# **Downloads**
3337

3438
## **Links**
3539

36-
Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v02.csv).
40+
Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v03.csv).
41+
3742

3843
| Dataset | GB | GB, compressed | Audio | Source | Manifest |
3944
|---------------------------------------|------|----------------|-------| -------| ----------|
45+
| audiobook_1 | 26 | 20.8 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_1.tar.gz) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_1.csv) |
46+
| audiobook_2 | 166 | 131.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ad), [part5](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ae), [part6](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_af), [part7](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ag) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_2.csv) |
47+
| asr_public_phone_calls_2 | 66 | 51.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ac) | ASR + public phone calls | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.csv) |
48+
| asr_public_stories_2 | 9 | 7.5 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.tar.gz) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.csv) |
4049
| tts_russian_addresses_rhvoice_4voices | 80.9 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ad) | TTS | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.csv) |
4150
| public_youtube700 | 75.0 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ad) | YouTube videos | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.csv) |
4251
| asr_public_phone_calls_1 | 22.7 | 19.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.tar.gz) | ASR + public phone calls | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.csv) |
@@ -71,6 +80,226 @@ Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_dat
7180
2. Download the meta data and manifests for each dataset:
7281
3. Merge files (where applicable), unpack and enjoy!
7382

83+
## **Check md5sum**
84+
85+
`md5sum /path/to/downloaded/file`
86+
87+
<details>
88+
<summary>Click to expand</summary>
89+
<table>
90+
<tr>
91+
<th>type</th>
92+
<th>md5sum</th>
93+
<th>file</th>
94+
</tr>
95+
<tr>
96+
<td>manifest</td>
97+
<td>b0ce7564ba90b121aeb13aada73a6e30</td>
98+
<td>asr_public_phone_calls_1.csv</td>
99+
</tr>
100+
<tr>
101+
<td>manifest</td>
102+
<td>6867d14dfdec1f9e9b8ca2f1de9ceda6</td>
103+
<td>asr_public_phone_calls_2.csv</td>
104+
</tr>
105+
<tr>
106+
<td>manifest</td>
107+
<td>0bdd77e15172e654d9a1999a86e92c7f</td>
108+
<td>asr_public_stories_1.csv</td>
109+
</tr>
110+
<tr>
111+
<td>manifest</td>
112+
<td>f388013039d94dc36970547944db51c7</td>
113+
<td>asr_public_stories_2.csv</td>
114+
</tr>
115+
<tr>
116+
<td>manifest</td>
117+
<td>697738331b6021890c29a0d415d0f22d</td>
118+
<td>private_buriy_audiobooks_1.csv</td>
119+
</tr>
120+
<tr>
121+
<td>manifest</td>
122+
<td>3b67e27c1429593cccbf7c516c4b582d</td>
123+
<td>private_buriy_audiobooks_2.csv</td>
124+
</tr>
125+
<tr>
126+
<td>manifest</td>
127+
<td>04027c20eb3aff05f6067957ecff856b</td>
128+
<td>public_lecture_1.csv</td>
129+
</tr>
130+
<tr>
131+
<td>manifest</td>
132+
<td>89da3f1b6afcd4d4936662ceabf3033e</td>
133+
<td>public_series_1.csv</td>
134+
</tr>
135+
<tr>
136+
<td>manifest</td>
137+
<td>a81dfb018c88d0ecd5194ab3d8ff6c95</td>
138+
<td>public_youtube700.csv</td>
139+
</tr>
140+
<tr>
141+
<td>manifest</td>
142+
<td>c858f020729c34ba0ab525bbb8950d0c</td>
143+
<td>ru_RU.csv</td>
144+
</tr>
145+
<tr>
146+
<td>manifest</td>
147+
<td>0275525914825dec663fd53390fdc9a0</td>
148+
<td>russian_single.csv</td>
149+
</tr>
150+
<tr>
151+
<td>manifest</td>
152+
<td>52f406f4e30fcc8c634f992befd91beb</td>
153+
<td>tts_russian_addresses_rhvoice_4voices.csv</td>
154+
</tr>
155+
<tr>
156+
<td>audio</td>
157+
<td>a5496898ee78654bf398ec6df71540d7</td>
158+
<td>asr_public_phone_calls_1.tar.gz</td>
159+
</tr>
160+
<tr>
161+
<td>audio</td>
162+
<td>e4df5ef50787384648b59f5a87edc0c6</td>
163+
<td>asr_public_phone_calls_2.tar.gz</td>
164+
</tr>
165+
<tr>
166+
<td>audio</td>
167+
<td>97594127a922df8a7bcc2eecd2470805</td>
168+
<td>asr_public_phone_calls_2.tar.gz_aa</td>
169+
</tr>
170+
<tr>
171+
<td>audio</td>
172+
<td>f9b6475f0f2898b16d9e6e0e648fb531</td>
173+
<td>asr_public_phone_calls_2.tar.gz_ab</td>
174+
</tr>
175+
<tr>
176+
<td>audio</td>
177+
<td>b19977c889cda639f621195251e6bb6f</td>
178+
<td>asr_public_phone_calls_2.tar.gz_ac</td>
179+
</tr>
180+
<tr>
181+
<td>audio</td>
182+
<td>657a31b544b10295f909ef4b2ca5c156</td>
183+
<td>asr_public_stories_1.tar.gz</td>
184+
</tr>
185+
<tr>
186+
<td>audio</td>
187+
<td>7533581bb26975212817bcacb25546d0</td>
188+
<td>asr_public_stories_2.tar.gz</td>
189+
</tr>
190+
<tr>
191+
<td>audio</td>
192+
<td>d7d374025c56ca556d9cde86b9fdffda</td>
193+
<td>audiobooks_1.tar.gz</td>
194+
</tr>
195+
<tr>
196+
<td>audio</td>
197+
<td>3955616cd89761bf2d54d0e992f7eae5</td>
198+
<td>audiobooks_2.tar.gz_aa</td>
199+
</tr>
200+
<tr>
201+
<td>audio</td>
202+
<td>81b6ec147c0c43bdd56002c41e0288b8</td>
203+
<td>audiobooks_2.tar.gz_ab</td>
204+
</tr>
205+
<tr>
206+
<td>audio</td>
207+
<td>15d4cf99171c2db3f375619f4bd2b6d9</td>
208+
<td>audiobooks_2.tar.gz_ac</td>
209+
</tr>
210+
<tr>
211+
<td>audio</td>
212+
<td>50635b0f4bdf44fae96e5a65f4738e19</td>
213+
<td>audiobooks_2.tar.gz_ad</td>
214+
</tr>
215+
<tr>
216+
<td>audio</td>
217+
<td>f1103be39ffc2da4a98d8f6ddeb50aa0</td>
218+
<td>audiobooks_2.tar.gz_ae</td>
219+
</tr>
220+
<tr>
221+
<td>audio</td>
222+
<td>8b45d2bd8b1fa1d906e36b9fabd9fe4c</td>
223+
<td>audiobooks_2.tar.gz_af</td>
224+
</tr>
225+
<tr>
226+
<td>audio</td>
227+
<td>5104df44933b612b3c1bfc06f6376654</td>
228+
<td>audiobooks_2.tar.gz_ag</td>
229+
</tr>
230+
<tr>
231+
<td>audio</td>
232+
<td>e6b9e5f46811d33ea34ce50f6067a762</td>
233+
<td>public_lecture_1.tar.gz</td>
234+
</tr>
235+
<tr>
236+
<td>audio</td>
237+
<td>86ebf7e30986b8ee8df11f85b35588a0</td>
238+
<td>public_series_1.tar.gz</td>
239+
</tr>
240+
<tr>
241+
<td>audio</td>
242+
<td>dc260dd8151b4fce6cde6d80af13146d</td>
243+
<td>public_youtube700.tar.gz_aa</td>
244+
</tr>
245+
<tr>
246+
<td>audio</td>
247+
<td>04706ef0f98841ec8d2f20a83aca3cf1</td>
248+
<td>public_youtube700.tar.gz_ab</td>
249+
</tr>
250+
<tr>
251+
<td>audio</td>
252+
<td>e11d5b118bf71425e4915e61277a06a9</td>
253+
<td>public_youtube700.tar.gz_ac</td>
254+
</tr>
255+
<tr>
256+
<td>audio</td>
257+
<td>d9a93157263eb9d8078c0e0b88c271de</td>
258+
<td>public_youtube700.tar.gz_ad</td>
259+
</tr>
260+
<tr>
261+
<td>audio</td>
262+
<td>1bbba5eb2f4911c9ed20ec69cbd292cb</td>
263+
<td>ru_ru.tar.gz</td>
264+
</tr>
265+
<tr>
266+
<td>audio</td>
267+
<td>6f79a9c514ad48a5763e3142919fc765</td>
268+
<td>russian_single.tar.gz</td>
269+
</tr>
270+
<tr>
271+
<td>audio</td>
272+
<td>c926df1068218eb9cc8103c94003fcc6</td>
273+
<td>tts_russian_addresses_rhvoice_4voices.tar</td>
274+
</tr>
275+
<tr>
276+
<td>audio</td>
277+
<td>31d515e0bdfc467c3fe63088b817c15c</td>
278+
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_aa</td>
279+
</tr>
280+
<tr>
281+
<td>audio</td>
282+
<td>4ca15694a8d8a638bbdc5e90832eadb4</td>
283+
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_ab</td>
284+
</tr>
285+
<tr>
286+
<td>audio</td>
287+
<td>447559a38cd8bf61c5de64e602f06da3</td>
288+
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_ac</td>
289+
</tr>
290+
<tr>
291+
<td>audio</td>
292+
<td>9131347a97c2e794d7c6d5a265083e83</td>
293+
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_ad</td>
294+
</tr>
295+
<tr>
296+
<td>audio</td>
297+
<td>91e2115b17b1ad08649f428d2caa643b</td>
298+
<td>voxforge_ru.tar.gz</td>
299+
</tr>
300+
</table>
301+
</details>
302+
74303
# **Annotation methodology**
75304

76305
The dataset is compiled using open domain sources.
@@ -105,43 +334,55 @@ store_path = Path(root_folder,
105334

106335
Use helper functions from here for easier work with manifest files.
107336

108-
Read manifests:
109-
```
337+
#### **Read manifests**
338+
<details><summary>See example</summary>
339+
<p>
340+
341+
```python
110342
from utils.open_stt_utils import read_manifest
111343

112344
manifest_df = read_manifest('path/to/manifest.csv')
113345
```
114346

115-
Merge, check and save manifests:
116-
```
347+
</p>
348+
</details>
349+
350+
#### **Merge, check and save manifests**
351+
<details><summary>See example</summary>
352+
<p>
353+
354+
```python
117355
from utils.open_stt_utils import (plain_merge_manifests,
118356
check_files,
119357
save_manifest)
120-
121358
train_manifests = [
122-
'path/to/manifest1.csv',
123-
'path/to/manifest2.csv',
359+
'path/to/manifest1.csv',
360+
'path/to/manifest2.csv',
124361
]
125-
126-
train_manifest = plain_merge_manifests(train_manifests,
362+
train_manifest = plain_merge_manifests(train_manifests,
127363
MIN_DURATION=0.1,
128364
MAX_DURATION=100)
129365
check_files(train_manifest)
130-
131366
save_manifest(train_manifest,
132367
'my_manifest.csv')
133368
```
134369

370+
</p>
371+
</details>
372+
135373
# **Contacts**
136374

137-
Please contact me [here](https://t.me/snakers41) or just create a GitHub issue!
375+
Please contact us [here](https://t.me/snakers41) or just create a GitHub issue!
138376

139377
# **FAQ**
140378

141379
## **1. Issues with reading files**
142380

143-
Maybe try this approach:
144-
```
381+
#### **Maybe try this approach:**
382+
<details><summary>See example</summary>
383+
<p>
384+
385+
```python
145386
from scipy.io import wavfile
146387

147388
sample_rate, sound = wavfile.read(path)
@@ -151,6 +392,10 @@ sound = sound.astype('float32')
151392
if abs_max>0:
152393
sound *= 1/abs_max
153394
```
395+
396+
</p>
397+
</details>
398+
154399
## **2. Why share such dataset?**
155400

156401
We are not altruists, life just is **not a zero sum game**.
@@ -163,3 +408,11 @@ Consider the progress in computer vision, that was made possible by:
163408

164409
TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English.
165410
Ultimately it leads to worse-off situation for the general community.
411+
412+
## **3. Known issues with the dataset to be fixed**
413+
- Blank files in Youtube dataset. Just filter them out using meta-data. Will be fixed in future;
414+
- Some files that have low values / crash with tochaudio;
415+
- Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;
416+
417+
# **License**
418+
Dual license, cc-by-nc and commercial usage available after agreement with dataset authors

0 commit comments

Comments
 (0)