Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference #28

Open
voicegen opened this issue Jul 10, 2023 · 7 comments
Open

Inference #28

voicegen opened this issue Jul 10, 2023 · 7 comments

Comments

@voicegen
Copy link

Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.

@deepanwayx
Copy link
Collaborator

Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?

@965694547
Copy link

Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?

I use https://huggingface.co/declare-lab/tango to generate 886 audio files, and use Guidance Scale=3 Steps=200,and get {"frechet_distance": 28.07995041974766, "frechet_audio_distance": 2.2381015516014955, "kullback_leibler_divergence_sigmoid": 3.8415958881378174, "kullback_leibler_divergence_softmax": 2.097446918487549, "lsd": 2.0631229603209094, "psnr": 15.874651663776682, "ssim": 0.4171875863485156, "ssim_stft": 0.09866382013407798, "inception_score_mean": 7.612150196882789, "inception_score_std": 0.8235111705490618, "kernel_inception_distance_mean": 0.010067609062191894, "kernel_inception_distance_std": 1.404596756557554e-07}

@965694547
Copy link

Do I need to control the length of the generated audio to be the same as the original audio length to adjust its metrics.

@deepanwayx
Copy link
Collaborator

No, the length doesn't have to be controlled.

I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?

I just did two runs and got the following scores:

{
    "frechet_distance": 24.4243,
    "frechet_audio_distance": 1.7324,
    "kl_sigmoid": 3.5901,
    "kl_softmax": 1.3216,
    "lsd": 2.0861,
    "psnr": 15.6047,
    "ssim": 0.4061,
    "ssim_stft": 0.1027,
    "is_mean": 7.5181,
    "is_std": 0.6758,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:0",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974057_steps_200_guidance_3"
}
{
    "frechet_distance": 24.9405,
    "frechet_audio_distance": 1.6633,
    "kl_sigmoid": 3.551,
    "kl_softmax": 1.3122,
    "lsd": 2.0957,
    "psnr": 15.5877,
    "ssim": 0.405,
    "ssim_stft": 0.1027,
    "is_mean": 7.187,
    "is_std": 0.5192,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:3",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974524_steps_200_guidance_3"
}

Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.

@965694547
Copy link

Thank you for explaination.

@965694547
Copy link

No, the length doesn't have to be controlled.

I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?

I just did two runs and got the following scores:

{
    "frechet_distance": 24.4243,
    "frechet_audio_distance": 1.7324,
    "kl_sigmoid": 3.5901,
    "kl_softmax": 1.3216,
    "lsd": 2.0861,
    "psnr": 15.6047,
    "ssim": 0.4061,
    "ssim_stft": 0.1027,
    "is_mean": 7.5181,
    "is_std": 0.6758,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:0",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974057_steps_200_guidance_3"
}
{
    "frechet_distance": 24.9405,
    "frechet_audio_distance": 1.6633,
    "kl_sigmoid": 3.551,
    "kl_softmax": 1.3122,
    "lsd": 2.0957,
    "psnr": 15.5877,
    "ssim": 0.405,
    "ssim_stft": 0.1027,
    "is_mean": 7.187,
    "is_std": 0.5192,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:3",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974524_steps_200_guidance_3"
}

Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.

I found that the sampling rate of the reference audio has an impact on the final result. I would like to ask about the sampling rate of your reference audio before coverting to 16k Hz.

@deepanwayx
Copy link
Collaborator

All our reference audio files are in 16 KHz.

I checked the AudioLDM Eval repository, and they now mention that the sampling rate can have an effect on the evaluation scores.

Their paper and evaluation code indicate that their scores are reported for 16 KHz. So we also report results with the same sampling rate for a fair comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants