Skip to content

Commit 35fcd5a

Browse files
authored
Merge pull request #77 from 916BGAI/dev
update speech doc and examples
2 parents 08c453e + 2a0bb85 commit 35fcd5a

File tree

9 files changed

+107
-74
lines changed

9 files changed

+107
-74
lines changed

docs/doc/en/audio/digit.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@ update:
1313

1414
## Maix-Speech
1515

16-
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project.
16+
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md).
1717

1818
## Continuous Chinese digit recognition
1919

2020
```python
2121
from maix import app, nn
2222

2323
speech = nn.Speech("/root/models/am_3332_192_int8.mud")
24-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
24+
speech.init(nn.SpeechDevice.DEVICE_MIC)
2525

2626
def callback(data: str, len: int):
2727
print(data)
@@ -32,7 +32,6 @@ while not app.need_exit():
3232
frames = speech.run(1)
3333
if frames < 1:
3434
print("run out\n")
35-
speech.deinit()
3635
break
3736
```
3837

@@ -55,10 +54,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
5554
3. Choose the corresponding audio device
5655

5756
```python
58-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
57+
speech.init(nn.SpeechDevice.DEVICE_MIC)
58+
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device
5959
```
6060

61-
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices.
61+
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input.
6262

6363
```python
6464
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input
@@ -74,11 +74,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in
7474
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
7575
```
7676

77-
- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache:
78-
77+
- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache:
7978

8079
```python
81-
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
80+
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
8281
```
8382

8483
4. Set up the decoder
@@ -89,11 +88,15 @@ def callback(data: str, len: int):
8988

9089
speech.digit(640, callback)
9190
```
92-
- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `digit` decoder is registered to output the Chinese digit recognition results from the last 4 seconds. The returned recognition results are in string format and support `0123456789 .(dot) S(ten) B(hundred) Q(thousand) W(thousand)`. For other decoder usages, please refer to the sections on Real-time voice recognition and keyword recognition.
91+
- The user can configure multiple decoders simultaneously. `digit` decoder is registered to output the Chinese digit recognition results from the last 4 seconds. The returned recognition results are in string format and support `0123456789 .(dot) S(ten) B(hundred) Q(thousand) W(thousand)`.
9392

9493
- When setting the `digit` decoder, you need to specify a `blank` value; exceeding this value (in ms) will insert a `_` in the output results to indicate idle silence.
9594

96-
- After registering the decoder, use the `speech.deinit()` method to clear the initialization.
95+
- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method.
96+
97+
```python
98+
speech.dec_deinit(nn.SpeechDecoder.DECODER_DIG)
99+
```
97100

98101
5. Recognition
99102

@@ -102,12 +105,15 @@ while not app.need_exit():
102105
frames = speech.run(1)
103106
if frames < 1:
104107
print("run out\n")
105-
speech.deinit()
106108
break
107109
```
108110

109111
- Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.
110112

113+
- To clear the cache of recognized results, you can use the `speech.clear` method.
114+
115+
- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results.
116+
111117
### Recognition Results
112118

113119
If the above program runs successfully, speaking into the onboard microphone will yield continuous Chinese digit recognition results, such as:

docs/doc/en/audio/keyword.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@ update:
1313

1414
## Maix-Speech
1515

16-
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project.
16+
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md).
1717

1818
## Keyword recognition
1919

2020
```python
2121
from maix import app, nn
2222

2323
speech = nn.Speech("/root/models/am_3332_192_int8.mud")
24-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
24+
speech.init(nn.SpeechDevice.DEVICE_MIC)
2525

2626
kw_tbl = ['xiao3 ai4 tong2 xue2',
2727
'ni3 hao3',
@@ -39,7 +39,6 @@ while not app.need_exit():
3939
frames = speech.run(1)
4040
if frames < 1:
4141
print("run out\n")
42-
speech.deinit()
4342
break
4443
```
4544

@@ -62,10 +61,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
6261
3. Choose the corresponding audio device
6362

6463
```python
65-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
64+
speech.init(nn.SpeechDevice.DEVICE_MIC)
65+
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device
6666
```
6767

68-
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices.
68+
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input.
6969

7070
```python
7171
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input
@@ -81,11 +81,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in
8181
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
8282
```
8383

84-
- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache:
85-
84+
- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache:
8685

8786
```python
88-
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
87+
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
8988
```
9089

9190
4. Set up the decoder
@@ -103,7 +102,7 @@ def callback(data:list[float], len: int):
103102

104103
speech.kws(kw_tbl, kw_gate, callback, True)
105104
```
106-
- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `kws` decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation. For other decoder usages, please refer to the sections on Real-time voice recognition and continuous Chinese numeral recognition.
105+
- The user can configure multiple decoders simultaneously. `kws` decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation.
107106

108107
- When setting up the `kws` decoder, you need to provide a `keyword list` separated by spaces in Pinyin, a `keyword probability threshold list` arranged in order, and specify whether to enable `automatic near-sound processing`. If set to `True`, different tones of the same Pinyin will be treated as similar words to accumulate probabilities. Finally, you need to set a callback function to handle the decoded data.
109108

@@ -114,7 +113,11 @@ similar_char = ['zhen3', 'zheng3']
114113
speech.similar('zen3', similar_char)
115114
```
116115

117-
- After registering the decoder, use the `speech.deinit()` method to clear the initialization.
116+
- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method.
117+
118+
```python
119+
speech.dec_deinit(nn.SpeechDecoder.DECODER_KWS)
120+
```
118121

119122
5. Recognition
120123

@@ -123,12 +126,15 @@ while not app.need_exit():
123126
frames = speech.run(1)
124127
if frames < 1:
125128
print("run out\n")
126-
speech.deinit()
127129
break
128130
```
129131

130132
- Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.
131133

134+
- To clear the cache of recognized results, you can use the `speech.clear` method.
135+
136+
- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results.
137+
132138
### Recognition Results
133139

134140
If the above program runs successfully, speaking into the onboard microphone will yield keyword recognition results, such as:

docs/doc/en/audio/recognize.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@ update:
1313

1414
## Maix-Speech
1515

16-
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech library specifically designed for embedded environments. It features deep optimization of speech recognition algorithms, achieving a significant lead in memory usage while maintaining excellent WER. For more details on the principles, please refer to the open-source project.
16+
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the [Maix-Speech Documentation](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md).
1717

1818
## Continuous Large Vocabulary Speech Recognition
1919

2020
```python
2121
from maix import app, nn
2222

2323
speech = nn.Speech("/root/models/am_3332_192_int8.mud")
24-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
24+
speech.init(nn.SpeechDevice.DEVICE_MIC)
2525

2626
def callback(data: tuple[str, str], len: int):
2727
print(data)
@@ -36,7 +36,6 @@ while not app.need_exit():
3636
frames = speech.run(1)
3737
if frames < 1:
3838
print("run out\n")
39-
speech.deinit()
4039
break
4140
```
4241

@@ -59,10 +58,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
5958
3. Choose the corresponding audio device
6059

6160
```python
62-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
61+
speech.init(nn.SpeechDevice.DEVICE_MIC)
62+
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device
6363
```
6464

65-
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input devices.
65+
- This uses the onboard microphone and supports both `WAV` and `PCM` audio as input.
6666

6767
```python
6868
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input
@@ -78,11 +78,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio in
7878
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
7979
```
8080

81-
- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.devive` method, which will automatically clear the cache:
82-
81+
- When recognizing `PCM/WAV` , if you want to reset the data source, such as for the next WAV file recognition, you can use the `speech.device` method, which will automatically clear the cache:
8382

8483
```python
85-
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
84+
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
8685
```
8786

8887
4. Set up the decoder
@@ -97,11 +96,15 @@ speech.lvcsr(lmS_path + "lg_6m.sfst", lmS_path + "lg_6m.sym", \
9796
lmS_path + "phones.bin", lmS_path + "words_utf.bin", \
9897
callback)
9998
```
100-
- Users can register several decoders (or none), which decode the results from the acoustic model and execute the corresponding user callback. Here, a `lvcsr` decoder is registered to output continuous speech recognition results (for fewer than 1024 Chinese characters). For other decoder usages, please refer to the sections on continuous Chinese numeral recognition and keyword recognition.
99+
- The user can configure multiple decoders simultaneously. `lvcsr` decoder is registered to output continuous speech recognition results (for fewer than 1024 Chinese characters).
101100

102101
- When setting up the `lvcsr` decoder, you need to specify the paths for the `sfst` file, the `sym` file (output symbol table), the path for `phones.bin` (phonetic table), and the path for `words.bin` (dictionary). Lastly, a callback function must be set to handle the decoded data.
103102

104-
- After registering the decoder, use the `speech.deinit()` method to clear the initialization.
103+
- If a decoder is no longer needed, you can deinitialize it by calling the `speech.dec_deinit` method.
104+
105+
```python
106+
speech.dec_deinit(nn.SpeechDecoder.DECODER_LVCSR)
107+
```
105108

106109
5. Recognition
107110

@@ -110,12 +113,15 @@ while not app.need_exit():
110113
frames = speech.run(1)
111114
if frames < 1:
112115
print("run out\n")
113-
speech.deinit()
114116
break
115117
```
116118

117119
- Use the `speech.run` method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.
118120

121+
- To clear the cache of recognized results, you can use the `speech.clear` method.
122+
123+
- When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use `speech.skip_frames(1)` to skip the first frame and ensure the accuracy of subsequent results.
124+
119125
### Recognition Results
120126

121127
If the above program runs successfully, speaking into the onboard microphone will yield real-time speech recognition results, such as:

docs/doc/zh/audio/digit.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@ update:
1313

1414
## Maix-Speech
1515

16-
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是专为嵌入式环境设计的离线语音库,其针对语音识别算法进行了深度优化,在内存占用上达到了数量级上的领先,并且保持了优良的WER。如果想了解原理可查看该开源项目
16+
[`Maix-Speech`](https://github.com/sipeed/Maix-Speech) 是一款专为嵌入式环境设计的离线语音识别库,针对语音识别算法进行了深度优化,显著降低内存占用,同时在识别准确率方面表现优异。详细说明请参考 [Maix-Speech 使用文档](https://github.com/sipeed/Maix-Speech/blob/master/usage_zh.md)
1717

1818
## 连续中文数字识别
1919

2020
```python
2121
from maix import app, nn
2222

2323
speech = nn.Speech("/root/models/am_3332_192_int8.mud")
24-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
24+
speech.init(nn.SpeechDevice.DEVICE_MIC)
2525

2626
def callback(data: str, len: int):
2727
print(data)
@@ -32,7 +32,6 @@ while not app.need_exit():
3232
frames = speech.run(1)
3333
if frames < 1:
3434
print("run out\n")
35-
speech.deinit()
3635
break
3736
```
3837

@@ -55,10 +54,11 @@ speech = nn.Speech("/root/models/am_3332_192_int8.mud")
5554
3. 选择对应的音频设备
5655

5756
```python
58-
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0")
57+
speech.init(nn.SpeechDevice.DEVICE_MIC)
58+
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # 指定音频输入设备
5959
```
6060

61-
- 这里使用的是板载的麦克风,也选择 `WAV``PCM` 音频作为输入设备
61+
- 这里使用的是板载的麦克风,也选择 `WAV``PCM` 音频作为输入
6262

6363
```python
6464
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # 使用 WAV 音频输入
@@ -74,11 +74,10 @@ speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # 使用 PCM 音频
7474
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
7575
```
7676

77-
-`PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.devive` 方法,内部会自动进行缓存清除操作:
78-
77+
-`PCM/WAV` 识别时,如果想要重新设置数据源,例如进行下一个WAV文件的识别可以使用 `speech.device` 方法,内部会自动进行缓存清除操作:
7978

8079
```python
81-
speech.devive(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
80+
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
8281
```
8382

8483
4. 设置解码器
@@ -89,11 +88,15 @@ def callback(data: str, len: int):
8988

9089
speech.digit(640, callback)
9190
```
92-
- 用户可以注册若干个解码器(也可以不注册),解码器的作用是解码声学模型的结果,并执行对应的用户回调。这里注册了一个 `digit` 解码器用于输出最近4s内的中文数字识别结果。返回的识别结果为字符串形式,支持 `0123456789 .(点) S(十) B(百) Q(千) W(万)`对于其他解码器的使用可以查看语音实时识别和关键词识别部分
91+
- 用户可以同时设置多个解码器,`digit` 解码器的作用是输出最近4s内的中文数字识别结果。返回的识别结果为字符串形式,支持 `0123456789 .(点) S(十) B(百) Q(千) W(万)`
9392

9493
- 设置 `digit` 解码器时需要设置 `blank` 值,超过该值(ms)则在输出结果里插入一个 `_` 表示空闲静音
9594

96-
- 在注册完解码器后需要使用 `speech.deinit()` 方法清除初始化
95+
- 如果不再需要使用某个解码器,可以通过调用 `speech.dec_deinit` 方法进行解除初始化。
96+
97+
```python
98+
speech.dec_deinit(nn.SpeechDecoder.DECODER_DIG)
99+
```
97100

98101
5. 识别
99102

@@ -102,12 +105,15 @@ while not app.need_exit():
102105
frames = speech.run(1)
103106
if frames < 1:
104107
print("run out\n")
105-
speech.deinit()
106108
break
107109
```
108110

109111
- 使用 `speech.run` 方法运行语音识别,传入的参数为每次运行的帧数,返回实际运行的帧数。用户可以选择每次运行1帧后进行其他处理,或在一个线程中持续运行,使用外部线程进行停止。
110112

113+
- 若需清除已识别结果的缓存,可以使用 `speech.clear` 方法。
114+
115+
- 在识别过程中切换解码器,切换后的第一帧可能会出现识别错误。可以使用 `speech.skip_frames(1)` 跳过第一帧,确保后续结果准确。
116+
111117
### 识别结果
112118

113119
如果上述程序运行正常,对板载麦克风说话,会得到连续中文数字识别结果,如:

0 commit comments

Comments
 (0)