-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
convert_hf_dataset_to_nemo.py
(used in ASR CTC Language Finetuning tutorial) fails with Common Voice 17.0 because trust_remote_code
is no longer supported in datasets
. The dataset loader script (common_voice_17_0.py
) cannot be used anymore.
Steps/Code to reproduce bug
!python convert_hf_dataset_to_nemo.py \
output_dir=datasets/en \
path=mozilla-foundation/common_voice_17_0 \
name=en \
split=train \
use_auth_token=True
Error
[2025-08-28 05:01:24,467][datasets.load][ERROR] - `trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'mozilla-foundation/common_voice_17_0' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
HuggingFace datasets failed due to some reason (stack trace below).
For certain datasets (eg: MCV), it may be necessary to login to the huggingface-cli (via `huggingface-cli login`).
Once logged in, you need to set `use_auth_token=True` when calling this script.
Traceback error for reference :
Traceback (most recent call last):
File "/content/convert_hf_dataset_to_nemo.py", line 361, in main
dataset = load_dataset(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1392, in load_dataset
builder_instance = load_dataset_builder(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1132, in load_dataset_builder
dataset_module = dataset_module_factory(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 1031, in dataset_module_factory
raise e1 from None
File "/usr/local/lib/python3.12/dist-packages/datasets/load.py", line 989, in dataset_module_factory
raise RuntimeError(f"Dataset scripts are no longer supported, but found {filename}")
RuntimeError: Dataset scripts are no longer supported, but found common_voice_17_0.py
Expected behavior
Script should load the Parquet-backed version of Common Voice 17.0 and generate NeMo-compatible manifests without requiring deprecated loader scripts or trust_remote_code
.
Environment overview (please complete the following information)
- Environment location: Google Colab
- Method of NeMo install:
python -m pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]
Environment details
- OS version: Ubuntu 22.04.4 LTS
- PyTorch version: 2.8.0+cu126
- Python version: 3.12.11
Additional context
trust_remote_code
can be removed from the script.- The ASR tutorial can be updated to reflect this change.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working