Skip to content

ASR tutorial / convert_hf_dataset_to_nemo.py broken with Common Voice 17.0 #14596

@zefang-liu

Description

@zefang-liu

Describe the bug

convert_hf_dataset_to_nemo.py (used in ASR CTC Language Finetuning tutorial) fails with Common Voice 17.0 because trust_remote_code is no longer supported in datasets. The dataset loader script (common_voice_17_0.py) cannot be used anymore.

Steps/Code to reproduce bug

!python convert_hf_dataset_to_nemo.py \
    output_dir=datasets/en \
    path=mozilla-foundation/common_voice_17_0 \
    name=en \
    split=train \
    use_auth_token=True

Error

[2025-08-28 05:01:24,467][datasets.load][ERROR] - `trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'mozilla-foundation/common_voice_17_0' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
HuggingFace datasets failed due to some reason (stack trace below). 
For certain datasets (eg: MCV), it may be necessary to login to the huggingface-cli (via `huggingface-cli login`).
Once logged in, you need to set `use_auth_token=True` when calling this script.

Traceback error for reference :

Traceback (most recent call last):
  File "/content/convert_hf_dataset_to_nemo.py", line 361, in main
    dataset = load_dataset(
              ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1031, in dataset_module_factory
    raise e1 from None
  File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 989, in dataset_module_factory
    raise RuntimeError(f"Dataset scripts are no longer supported, but found {filename}")
RuntimeError: Dataset scripts are no longer supported, but found common_voice_17_0.py

Expected behavior

Script should load the Parquet-backed version of Common Voice 17.0 and generate NeMo-compatible manifests without requiring deprecated loader scripts or trust_remote_code.

Environment overview (please complete the following information)

  • Environment location: Google Colab
  • Method of NeMo install: python -m pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Environment details

  • OS version: Ubuntu 22.04.4 LTS
  • PyTorch version: 2.8.0+cu126
  • Python version: 3.12.11

Additional context

  • trust_remote_code can be removed from the script.
  • The ASR tutorial can be updated to reflect this change.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions