Skip to content

Update zoengjyutgaai_saamgwokjinji.py #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

indiejoseph
Copy link

Since the dataset on Huggingface has been updated, the split "train" has been removed and replaced by "saamgwokjinji", "seoiwuzyun".

Since the dataset on Huggingface has been updated, the split "train" has been removed and replaced by "saamgwokjinji", "seoiwuzyun".
@AlienKevin
Copy link
Owner

Thanks! Can you add the mouzaakdung split as well?

@indiejoseph
Copy link
Author

Done, but I tried run it, it contains much more audio samples then before, and takes too long time to run it, I think it is unnecessarily to through all samples

@AlienKevin
Copy link
Owner

Sure, maybe we can select a random subset from each dataset?

@indiejoseph
Copy link
Author

The dataset is keep updating, it is hard for benchmarking when things changing, see would @laubonghaudoi could create a test split with 1k samples in the repo, or we fork one.

@laubonghaudoi
Copy link

如果全部都跑一次嘅話要幾多時間?

@indiejoseph
Copy link
Author

H100 用 SenseVoice batch size 64 好快,大約45分鐘,但 whisper v3 好慢,我有用 custom processor 所以冇用 pipeline 去跑 batch, 所以唔可以作準, 遲啲用 pipeline 跑一次再報告。

@laubonghaudoi
Copy link

唔該晒,原來 H100都要咁耐。呢個PR可唔可以順便改埋個README?而家仲寫住

[Storytelling] Zoeng Jyutgaai's Romance of the Three Kingdomgs: 1,402 utterances by Cantonese storyteller Zoeng Jyutgaai (張悦楷) on the classical work "Romance of the three kingdoms".

個數據集名已經改成 CanCLID/zoengjyutgaai而且都唔止咁少 utterance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants