Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write a load_example method #63

Open
alex-hh opened this issue Nov 10, 2024 · 2 comments
Open

write a load_example method #63

alex-hh opened this issue Nov 10, 2024 · 2 comments

Comments

@alex-hh
Copy link
Collaborator

alex-hh commented Nov 10, 2024

assuming a dataset has an id field and an index.

@alex-hh
Copy link
Collaborator Author

alex-hh commented Nov 10, 2024

index will be a parquet file with no extension mapping id to shard - then we can download a single shard and retrieve the example

@alex-hh
Copy link
Collaborator Author

alex-hh commented Nov 10, 2024

What we need:

a split generator that looks for config+split-specific index files (train_index or train/index)
index files allow us to subset both parquets and examples
we then add a ds.filter before returning the dataset.
there might be an efficient arrow way to implement the filter

(this could also go directly into yaml but the index file solution is more modular).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant