Skip to content

Commit 499289f

Browse files
committed
feat(examples): use mp3 output by default
1 parent db13e42 commit 499289f

File tree

10 files changed

+63
-15
lines changed

10 files changed

+63
-15
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,3 +169,4 @@ cython_debug/
169169

170170
# inferred result
171171
*.wav
172+
*.mp3

README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ A generative speech model for daily dialogue.
1111
[![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/2noise/ChatTTS/blob/main/examples/ipynb/colab.ipynb)
1212
[![Discord](https://img.shields.io/badge/ChatTTS-Discord-7289DA?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/Ud5Jxgx5yD)
1313

14-
**English** | [**简体中文**](docs/cn/README.md) | [**日本語**](docs/jp/README.md) | [**Русский**](docs/ru/README.md)
14+
**English** | [**简体中文**](docs/cn/README.md) | [**日本語**](docs/jp/README.md) | [**Русский**](docs/ru/README.md) | [**Español**](docs/es/README.md)
1515

1616
</div>
1717

@@ -93,29 +93,31 @@ pip install -r requirements.txt
9393
```
9494

9595
### Quick Start
96+
> Make sure you are under the project root directory when you execute these commands below.
97+
9698
#### 1. Launch WebUI
9799
```bash
98100
python examples/web/webui.py
99101
```
100102

101103
#### 2. Infer by Command Line
102-
> It will save audio to `./output_audio_xxx.wav`
104+
> It will save audio to `./output_audio_n.mp3`
103105
104106
```bash
105-
python examples/cmd/run.py "Please input your text."
107+
python examples/cmd/run.py "Your text 1." "Your text 2."
106108
```
107109

108110
### Basic
109111

110112
```python
111113
import ChatTTS
112-
from IPython.display import Audio
114+
import torch
113115
import torchaudio
114116

115117
chat = ChatTTS.Chat()
116118
chat.load(compile=False) # Set to True for better performance
117119

118-
texts = ["PUT YOUR TEXT HERE",]
120+
texts = ["PUT YOUR 1st TEXT HERE", "PUT YOUR 2nd TEXT HERE"]
119121

120122
wavs = chat.infer(texts)
121123

@@ -154,6 +156,7 @@ wavs = chat.infer(
154156

155157
###################################
156158
# For word level manual control.
159+
157160
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
158161
wavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text, params_infer_code=params_infer_code)
159162
torchaudio.save("output2.wav", torch.from_numpy(wavs[0]), 24000)

docs/cn/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
[![Huggingface](https://img.shields.io/badge/🤗%20-Models-yellow.svg?style=for-the-badge)](https://huggingface.co/2Noise/ChatTTS)
1111
[![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/2noise/ChatTTS/blob/main/examples/ipynb/colab.ipynb)
1212

13-
[**English**](../../README.md) | **简体中文** | [**日本語**](../jp/README.md) | [**Русский**](../ru/README.md)
13+
[**English**](../../README.md) | **简体中文** | [**日本語**](../jp/README.md) | [**Русский**](../ru/README.md) | [**Español**](../es/README.md)
1414

1515
</div>
1616

docs/jp/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
[![Huggingface](https://img.shields.io/badge/🤗%20-Models-yellow.svg?style=for-the-badge)](https://huggingface.co/2Noise/ChatTTS)
66

7-
[**English**](../../README.md) | [**简体中文**](../cn/README.md) | **日本語** | [**Русский**](../ru/README.md)
7+
[**English**](../../README.md) | [**简体中文**](../cn/README.md) | **日本語** | [**Русский**](../ru/README.md) | [**Español**](../es/README.md)
88

99
ChatTTSは、LLMアシスタントなどの対話シナリオ用に特別に設計されたテキストから音声へのモデルです。英語と中国語の両方をサポートしています。私たちのモデルは、中国語と英語で構成される100,000時間以上でトレーニングされています。**[HuggingFace](https://huggingface.co/2Noise/ChatTTS)**でオープンソース化されているバージョンは、40,000時間の事前トレーニングモデルで、SFTは行われていません。
1010

docs/ru/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
[![Huggingface](https://img.shields.io/badge/🤗%20-Models-yellow.svg?style=for-the-badge)](https://huggingface.co/2Noise/ChatTTS)
66

7-
[**English**](../../README.md) | [**简体中文**](../cn/README.md) | [**日本語**](../jp/README.md) | **Русский**
7+
[**English**](../../README.md) | [**简体中文**](../cn/README.md) | [**日本語**](../jp/README.md) | **Русский** | [**Español**](../es/README.md)
88

99
ChatTTS - это модель преобразования текста в речь, специально разработанная для диалоговых сценариев, таких как помощник LLM. Она поддерживает как английский, так и китайский языки. Наша модель обучена на более чем 100 000 часах английского и китайского языков. Открытая версия на **[HuggingFace](https://huggingface.co/2Noise/ChatTTS)** - это предварительно обученная модель с 40 000 часами без SFT.
1010

examples/cmd/run.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,28 @@
88

99
import wave
1010
import argparse
11+
from io import BytesIO
1112

1213
import ChatTTS
1314

14-
from tools.audio import unsafe_float_to_int16
15+
from tools.audio import unsafe_float_to_int16, wav2
1516
from tools.logger import get_logger
1617

1718
logger = get_logger("Command")
1819

1920

20-
def save_wav_file(wav, index):
21-
wav_filename = f"output_audio_{index}.wav"
22-
with wave.open(wav_filename, "wb") as wf:
21+
def save_mp3_file(wav, index):
22+
buf = BytesIO()
23+
with wave.open(buf, "wb") as wf:
2324
wf.setnchannels(1) # Mono channel
2425
wf.setsampwidth(2) # Sample width in bytes
2526
wf.setframerate(24000) # Sample rate in Hz
2627
wf.writeframes(unsafe_float_to_int16(wav))
27-
logger.info(f"Audio saved to {wav_filename}")
28+
buf.seek(0, 0)
29+
mp3_filename = f"output_audio_{index}.mp3"
30+
with open(mp3_filename, "wb") as f:
31+
wav2(buf, f, "mp3")
32+
logger.info(f"Audio saved to {mp3_filename}")
2833

2934

3035
def main(texts: list[str]):
@@ -42,7 +47,7 @@ def main(texts: list[str]):
4247
logger.info("Inference completed. Audio generation successful.")
4348
# Save each generated wav file to a local file
4449
for index, wav in enumerate(wavs):
45-
save_wav_file(wav, index)
50+
save_mp3_file(wav, index)
4651

4752

4853
if __name__ == "__main__":

examples/web/webui.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ def main():
7878
"Interrupt", scale=2, variant="stop", visible=False, interactive=False
7979
)
8080

81-
text_output = gr.Textbox(label="Output Text", interactive=False)
81+
text_output = gr.Textbox(label="Output Text", interactive=False, show_copy_button=True)
8282

8383
# 使用Gradio的回调功能来更新数值输入框
8484
voice_selection.change(
@@ -117,6 +117,7 @@ def make_audio(autoplay, stream):
117117
streaming=stream,
118118
interactive=False,
119119
show_label=True,
120+
format="mp3",
120121
)
121122
text_output.change(
122123
text_output_listener,

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ pybase16384
1212
pynini==2.1.5; sys_platform == 'linux'
1313
WeTextProcessing; sys_platform == 'linux'
1414
nemo_text_processing; sys_platform == 'linux'
15+
av

tools/audio/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
from .np import unsafe_float_to_int16
2+
from .av import wav2

tools/audio/av.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
from io import BufferedWriter, BytesIO
2+
from typing import Dict
3+
4+
import av
5+
6+
7+
video_format_dict: Dict[str, str] = {
8+
"m4a": "mp4",
9+
}
10+
11+
audio_format_dict: Dict[str, str] = {
12+
"ogg": "libvorbis",
13+
"mp4": "aac",
14+
}
15+
16+
17+
def wav2(i: BytesIO, o: BufferedWriter, format: str):
18+
"""
19+
https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI/blob/412a9950a1e371a018c381d1bfb8579c4b0de329/infer/lib/audio.py#L20
20+
"""
21+
inp = av.open(i, "r")
22+
format = video_format_dict.get(format, format)
23+
out = av.open(o, "w", format=format)
24+
format = audio_format_dict.get(format, format)
25+
26+
ostream = out.add_stream(format)
27+
28+
for frame in inp.decode(audio=0):
29+
for p in ostream.encode(frame):
30+
out.mux(p)
31+
32+
for p in ostream.encode(None):
33+
out.mux(p)
34+
35+
out.close()
36+
inp.close()

0 commit comments

Comments
 (0)