Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] refactor of dataset builder and executor #537

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

cyruszhang
Copy link
Collaborator

Key elements of this PR:

  1. YAML显式定义dataset不同来源;local和remote分开定义
  2. 更加灵活开放的数据集参数化控制;根据不同来源,支持不同参数和相关验证;并留出口子支持更多追加/细节配置
  3. 解绑Executor的hardcode支持(目前RayExecutor只接受local json格式,并在代码层面hardcode绑定);Executor/RayExecutor不绑定dataset输入格式,但是根据formatter/downloader对于executor类型的支持来判断是否可加载
  4. 提高Executor框架扩展性,以更方便支持Nemo、Dask、Spark等其他引擎
  5. 支持数据格式验证
  6. 额外的数据来源支持
    a. 支持modelscope
    b. 支持arxiv,下载、解压、引入
    c. 支持wiki,下载、解压、引入
    d. 支持commoncrawl,下载、解压、引入
  7. 兼容命令行目前的dataset_path格式
  8. 兼容数据混搭,data mixture
  9. 兼容empty_formatter/generated_dataset_config通路

design doc: https://aliyuque.antfin.com/yilei.z/cnk4dn/qomvqql62lyglrh2?singleDoc# 《Dataset/Loader/Executor的重构设计》

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant