Added DNSMOS in evaluation, updated Chinese README

Plachtaa · Plachtaa · commit e699e3e2cf3c · 2024-10-08T19:09:13.000+08:00
diff --git a/README-CN.md b/README-CN.md
@@ -9,6 +9,41 @@
 
 我们将继续改进模型质量并添加更多功能。
 
+## 评估📊
+
+我们对 Seed-VC 的语音转换能力进行了系列客观评估。  
+为了便于复现，源音频是来自 LibriTTS-test-clean 的 100 个随机语句，参考音频是 12 个随机挑选的具有独特特征的自然声音。<br>  
+
+源音频位于 `./examples/libritts-test-clean` <br>
+参考音频位于 `./examples/reference` <br>
+
+我们从说话人嵌入余弦相似度（SECS）、词错误率（WER）和字符错误率（CER）三个方面评估了转换结果，并将我们的结果与两个强大的开源基线模型，即 [OpenVoice](https://github.com/myshell-ai/OpenVoice) 和 [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)，进行了比较。  
+下表的结果显示，我们的 Seed-VC 模型在发音清晰度和说话人相似度上均显著优于基线模型。<br>
+
+| 模型\指标         | SECS↑      | WER↓       | CER↓       | SIG↑     | BAK↑     | OVRL↑    |
+|---------------|------------|------------|------------|----------|----------|----------|
+| Ground Truth  | 1.0000     | 0.0802     | 0.0157     | ~        | ~        | ~        |
+| OpenVoice     | 0.7547     | 0.1546     | 0.0473     | **3.56** | **4.02** | **3.27** |
+| CosyVoice     | 0.8440     | 0.1898     | 0.0729     | 3.51     | **4.02** | 3.21     |
+| Seed-VC（Ours） | **0.8676** | **0.1199** | **0.0292** | 3.42     | 3.97     | 3.11     |
+
+*ASR 结果由 [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) 模型计算*  
+*说话人嵌入由 [resemblyzer](https://github.com/resemble-ai/Resemblyzer) 模型计算* <br>
+
+你可以通过运行 `eval.py` 脚本来复现评估。  
+```bash
+python eval.py 
+--source ./examples/libritts-test-clean
+--target ./examples/reference
+--output ./examples/eval/converted
+--diffusion-steps 25
+--length-adjust 1.0
+--inference-cfg-rate 0.7
+--xvector-extractor "resemblyzer"
+--baseline ""  # 填入 openvoice 或 cosyvoice 来计算基线结果
+--max-samples 100  # 要处理的最大源语句数
+```
+在此之前，如果你想运行基线评估，请确保已在 `../OpenVoice/` 和 `../CosyVoice/` 目录下正确安装了 openvoice 和 cosyvoice 仓库。
 ## 安装 📥
 建议在 Windows 或 Linux 上使用 Python 3.10：
 ```bash
@@ -20,15 +55,14 @@ pip install -r requirements.txt
 
 命令行推理：
 ```bash
-python inference.py --source <源语音文件路径> \
---target <参考语音文件路径> \
---output <输出目录> \
---diffusion-steps 25 \ # 建议歌声转换时使用50~100
---length-adjust 1.0 \
---inference-cfg-rate 0.7 \
---n-quantizers 3 \
---f0-condition False \ # 歌声转换时设置为 True
---auto-f0-condition False \ # 设置为 True 可自动调整源音高到目标音高，歌声转换中通常不使用
+python inference.py --source <源语音文件路径>
+--target <参考语音文件路径>
+--output <输出目录>
+--diffusion-steps 25 # 建议歌声转换时使用50~100
+--length-adjust 1.0
+--inference-cfg-rate 0.7
+--f0-condition False # 歌声转换时设置为 True
+--auto-f0-condition False # 设置为 True 可自动调整源音高到目标音高，歌声转换中通常不使用
 --semi-tone-shift 0 # 歌声转换的半音移调
 ```
 其中:
@@ -38,7 +72,6 @@ python inference.py --source <源语音文件路径> \
 - `diffusion-steps` 使用的扩散步数，默认25，最佳质量建议使用50-100，最快推理使用4-10
 - `length-adjust` 长度调整系数，默认1.0，<1.0加速语音，>1.0减慢语音
 - `inference-cfg-rate` 对输出有细微影响，默认0.7
-- `n-quantizers` 用的 FAcodec 码本数量，默认3，使用的码本越少，保留的源音频韵律越少  
 - `f0-condition` 是否根据源音频的音高调整输出音高，默认 False，歌声转换时设置为 True  
 - `auto-f0-condition` 是否自动将源音高调整到目标音高水平，默认 False，歌声转换中通常不使用
 - `semi-tone-shift` 歌声转换中的半音移调，默认0  
@@ -59,13 +92,16 @@ python app.py
     - [x] 这已在 f0 条件模型中启用，但不确定效果如何...
 - [ ] 潜在的架构改进
     - [x] 类似U-ViT 的skip connection
-    - [x] 将输入更改为 [FAcodec](https://github.com/Plachtaa/FAcodec) tokens
+    - [x] 将输入更改为 OpenAI Whisper
 - [ ] 自定义数据训练代码
 - [x] 歌声解码器更改为 NVIDIA 的 BigVGAN
 - [ ] 44k Hz 歌声转换模型
 - [ ] 更多待添加
 
 ## 更新日志 🗒️
+- 2024-09-26:
+    - 添加了客观指标评估结果
+    - 将语音内容编码器更改为 OpenAI Whisper
 - 2024-09-22:
     - 将歌声转换模型的解码器更改为 BigVGAN，解决了大部分高音部分无法正确转换的问题
     - 在Web UI中支持对长输入音频的分段处理以及流式输出
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ We are keeping on improving the model quality and adding more features.
 
 ## Evaluation📊
 We have performed a series of objective evaluations on our Seed-VC's voice conversion capabilities. 
-For ease for reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>  
+For ease of reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>  
 
 Source audios can be found under `./examples/libritts-test-clean` <br>
 Reference audios can be found under `./examples/reference` <br>
@@ -19,20 +19,22 @@ We evaluate the conversion results in terms of speaker embedding cosine similari
 our results with two strong open sourced baselines, namely [OpenVoice](https://github.com/myshell-ai/OpenVoice) and [CosyVoice](https://github.com/FunAudioLLM/CosyVoice).  
 Results in the table below shows that our Seed-VC model significantly outperforms the baseline models in both intelligibility and speaker similarity.<br>
 
-| Models\Metrics | SECS↑      | WER↓       | CER↓       |
-|----------------|------------|------------|------------|
-| OpenVoice      | 0.7547     | 0.1546     | 0.0473     |
-| CosyVoice      | 0.8440     | 0.1898     | 0.0729     |
-| Seed-VC(Ours)  | **0.8676** | **0.1199** | **0.0292** |  
+| Models\Metrics | SECS↑      | WER↓       | CER↓       | SIG↑     | BAK↑     | OVRL↑    |
+|----------------|------------|------------|------------|----------|----------|----------|
+| Ground Truth   | 1.0000     | 0.0802     | 0.0157     | ~        | ~        | ~        |
+| OpenVoice      | 0.7547     | 0.1546     | 0.0473     | **3.56** | **4.02** | **3.27** |
+| CosyVoice      | 0.8440     | 0.1898     | 0.0729     | 3.51     | **4.02** | 3.21     |
+| Seed-VC(Ours)  | **0.8676** | **0.1199** | **0.0292** | 3.42     | 3.97     | 3.11     |
 
 *ASR result computed by facebook/hubert-large-ls960-ft model*   
 *Speaker embedding computed by resemblyzer model* <br>  
 
 You can reproduce the evaluation by running `eval.py` script.  
 ```bash
-python eval.py --source ./examples/libritts-test-clean \
---reference ./examples/reference
---output ./examples/eval/converted/
+python eval.py 
+--source ./examples/libritts-test-clean
+--target ./examples/reference
+--output ./examples/eval/converted
 --diffusion-steps 25
 --length-adjust 1.0
 --inference-cfg-rate 0.7
@@ -53,7 +55,7 @@ Checkpoints of the latest model release will be downloaded automatically when fi
 
 Command line inference:
 ```bash
-python inference.py --source <source-wav> \
+python inference.py --source <source-wav>
 --target <referene-wav>
 --output <output-dir>
 --diffusion-steps 25 # recommended 50~100 for singingvoice conversion
diff --git a/baselines/dnsmos/dnsmos_computor.py b/baselines/dnsmos/dnsmos_computor.py
@@ -0,0 +1,130 @@
+import glob
+import librosa
+import tqdm
+import numpy as np
+import torchaudio
+import torch
+
+# ignore all warning
+import warnings
+
+warnings.filterwarnings("ignore")
+
+import concurrent.futures
+import glob
+import os
+import librosa
+import numpy as np
+import onnxruntime as ort
+import pandas as pd
+from tqdm import tqdm
+
+SAMPLING_RATE = 16000
+INPUT_LENGTH = 9.01
+
+
+class DNSMOSComputer:
+    def __init__(
+        self, primary_model_path, p808_model_path, device="cuda", device_id=0
+    ) -> None:
+        self.onnx_sess = ort.InferenceSession(
+            primary_model_path, providers=["CUDAExecutionProvider"]
+        )
+        self.p808_onnx_sess = ort.InferenceSession(
+            p808_model_path, providers=["CUDAExecutionProvider"]
+        )
+        self.onnx_sess.set_providers(["CUDAExecutionProvider"], [{"device_id": device_id}])
+        self.p808_onnx_sess.set_providers(
+            ["CUDAExecutionProvider"], [{"device_id": device_id}]
+        )
+        kwargs = {
+            "sample_rate": 16000,
+            "hop_length": 160,
+            "n_fft": 320 + 1,
+            "n_mels": 120,
+            "mel_scale": "slaney",
+        }
+        self.mel_transform = torchaudio.transforms.MelSpectrogram(**kwargs).to(f"cuda:{device_id}")
+
+    def audio_melspec(
+        self, audio, n_mels=120, frame_size=320, hop_length=160, sr=16000, to_db=True
+    ):
+        mel_specgram = self.mel_transform(torch.Tensor(audio).cuda())
+        mel_spec = mel_specgram.cpu()
+        if to_db:
+            mel_spec = (librosa.power_to_db(mel_spec, ref=np.max) + 40) / 40
+        return mel_spec.T
+
+    def get_polyfit_val(self, sig, bak, ovr, is_personalized_MOS):
+        if is_personalized_MOS:
+            p_ovr = np.poly1d([-0.00533021, 0.005101, 1.18058466, -0.11236046])
+            p_sig = np.poly1d([-0.01019296, 0.02751166, 1.19576786, -0.24348726])
+            p_bak = np.poly1d([-0.04976499, 0.44276479, -0.1644611, 0.96883132])
+        else:
+            p_ovr = np.poly1d([-0.06766283, 1.11546468, 0.04602535])
+            p_sig = np.poly1d([-0.08397278, 1.22083953, 0.0052439])
+            p_bak = np.poly1d([-0.13166888, 1.60915514, -0.39604546])
+        sig_poly = p_sig(sig)
+        bak_poly = p_bak(bak)
+        ovr_poly = p_ovr(ovr)
+        return sig_poly, bak_poly, ovr_poly
+
+    def compute(self, audio, sampling_rate, is_personalized_MOS=False):
+        fs = SAMPLING_RATE
+        if isinstance(audio, str):
+            audio, _ = librosa.load(audio, sr=fs)
+        elif sampling_rate != fs:
+            # resample audio
+            audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=fs)
+        actual_audio_len = len(audio)
+        len_samples = int(INPUT_LENGTH * fs)
+        while len(audio) < len_samples:
+            audio = np.append(audio, audio)
+        num_hops = int(np.floor(len(audio) / fs) - INPUT_LENGTH) + 1
+        hop_len_samples = fs
+        predicted_mos_sig_seg_raw = []
+        predicted_mos_bak_seg_raw = []
+        predicted_mos_ovr_seg_raw = []
+        predicted_mos_sig_seg = []
+        predicted_mos_bak_seg = []
+        predicted_mos_ovr_seg = []
+        predicted_p808_mos = []
+
+        for idx in range(num_hops):
+            audio_seg = audio[
+                int(idx * hop_len_samples) : int((idx + INPUT_LENGTH) * hop_len_samples)
+            ]
+            if len(audio_seg) < len_samples:
+                continue
+            input_features = np.array(audio_seg).astype("float32")[np.newaxis, :]
+            p808_input_features = np.array(
+                self.audio_melspec(audio=audio_seg[:-160])
+            ).astype("float32")[np.newaxis, :, :]
+            oi = {"input_1": input_features}
+            p808_oi = {"input_1": p808_input_features}
+            p808_mos = self.p808_onnx_sess.run(None, p808_oi)[0][0][0]
+            mos_sig_raw, mos_bak_raw, mos_ovr_raw = self.onnx_sess.run(None, oi)[0][0]
+            mos_sig, mos_bak, mos_ovr = self.get_polyfit_val(
+                mos_sig_raw, mos_bak_raw, mos_ovr_raw, is_personalized_MOS
+            )
+            predicted_mos_sig_seg_raw.append(mos_sig_raw)
+            predicted_mos_bak_seg_raw.append(mos_bak_raw)
+            predicted_mos_ovr_seg_raw.append(mos_ovr_raw)
+            predicted_mos_sig_seg.append(mos_sig)
+            predicted_mos_bak_seg.append(mos_bak)
+            predicted_mos_ovr_seg.append(mos_ovr)
+            predicted_p808_mos.append(p808_mos)
+        clip_dict = {
+            "filename": "audio_clip",
+            "len_in_sec": actual_audio_len / fs,
+            "sr": fs,
+        }
+        clip_dict["num_hops"] = num_hops
+        clip_dict["OVRL_raw"] = np.mean(predicted_mos_ovr_seg_raw)
+        clip_dict["SIG_raw"] = np.mean(predicted_mos_sig_seg_raw)
+        clip_dict["BAK_raw"] = np.mean(predicted_mos_bak_seg_raw)
+        clip_dict["OVRL"] = np.mean(predicted_mos_ovr_seg)
+        clip_dict["SIG"] = np.mean(predicted_mos_sig_seg)
+        clip_dict["BAK"] = np.mean(predicted_mos_bak_seg)
+        clip_dict["P808_MOS"] = np.mean(predicted_p808_mos)
+        return clip_dict
diff --git a/baselines/dnsmos/model_v8.onnx b/baselines/dnsmos/model_v8.onnx
diff --git a/baselines/dnsmos/sig_bak_ovr.onnx b/baselines/dnsmos/sig_bak_ovr.onnx
diff --git a/eval.py b/eval.py
@@ -31,6 +31,29 @@
 import jiwer
 import string
 
+from baselines.dnsmos.dnsmos_computor import DNSMOSComputer
+
+def calc_mos(computor, audio, orin_sr):
+    # only 16k audio is supported
+    target_sr = 16000
+    if orin_sr != 16000:
+        audio = librosa.resample(
+            audio, orig_sr=orin_sr, target_sr=target_sr, res_type="kaiser_fast"
+        )
+    result = computor.compute(audio, target_sr, False)
+    sig, bak, ovr = result["SIG"], result["BAK"], result["OVRL"]
+
+    if ovr == 0:
+        print("calculate dns mos failed")
+    return sig, bak, ovr
+
+mos_computer = DNSMOSComputer(
+    "baselines/dnsmos/sig_bak_ovr.onnx",
+    "baselines/dnsmos/model_v8.onnx",
+    device="cuda",
+    device_id=0,
+)
+
 def load_models(args):
     dit_checkpoint_path, dit_config_path = load_custom_model_from_hf("Plachta/Seed-VC",
                                                                      "DiT_seed_v2_uvit_whisper_small_wavenet_bigvgan_pruned.pth",
@@ -175,6 +198,7 @@ def main(args):
     gt_cer_list = []
     vc_wer_list = []
     vc_cer_list = []
+    dnsmos_list = []
     for source_i, source_line in enumerate(tqdm(source_audio_list)):
         if source_i >= max_samples:
             break
@@ -287,12 +311,20 @@ def main(args):
             vc_wer_list.append(vc_wer)
             vc_cer_list.append(vc_cer)
 
+            # calculate dnsmos
+            sig, bak, ovr = calc_mos(mos_computer, vc_wave_16k.squeeze(0).cpu().numpy(), 16000)
+            dnsmos_list.append((sig, bak, ovr))
+
         print(f"Average GT WER: {sum(gt_wer_list) / len(gt_wer_list)}")
         print(f"Average GT CER: {sum(gt_cer_list) / len(gt_cer_list)}")
         print(f"Average VC WER: {sum(vc_wer_list) / len(vc_wer_list)}")
         print(f"Average VC CER: {sum(vc_cer_list) / len(vc_cer_list)}")
         print(f"Average similarity: {sum(similarity_list) / len(similarity_list)}")
 
+        print(f"Average DNS MOS SIG: {sum([x[0] for x in dnsmos_list]) / len(dnsmos_list)}")
+        print(f"Average DNS MOS BAK: {sum([x[1] for x in dnsmos_list]) / len(dnsmos_list)}")
+        print(f"Average DNS MOS OVR: {sum([x[2] for x in dnsmos_list]) / len(dnsmos_list)}")
+
         # save wer and cer result into this directory as a txt
         with open(osp.join(conversion_result_dir, source_index, "result.txt"), 'w') as f:
             f.write(f"GT WER: {sum(gt_wer_list[-len(target_audio_list):]) / len(target_audio_list)}\n")
@@ -316,6 +348,10 @@ def main(args):
         f.write(f"VC WER: {sum(vc_wer_list) / len(vc_wer_list)}\n")
         f.write(f"VC CER: {sum(vc_cer_list) / len(vc_cer_list)}\n")
 
+    print(f"Average DNS MOS SIG: {sum([x[0] for x in dnsmos_list]) / len(dnsmos_list)}")
+    print(f"Average DNS MOS BAK: {sum([x[1] for x in dnsmos_list]) / len(dnsmos_list)}")
+    print(f"Average DNS MOS OVR: {sum([x[2] for x in dnsmos_list]) / len(dnsmos_list)}")
+
 
 def convert(
     source_path,
diff --git a/inference.py b/inference.py
@@ -253,8 +253,6 @@ def main(args):
     # Convert to waveform
     # if f0_condition:
     vc_wave = bigvgan_model(vc_target).squeeze(1)  # wav_gen is FloatTensor with shape [B(1), 1, T_time] and values in [-1, 1]
-    # else:
-    #     vc_wave = hift_gen.inference(vc_target, f0=None)
 
     time_vc_end = time.time()
     print(f"RTF: {(time_vc_end - time_vc_start) / vc_wave.size(-1) * sr}")