Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data modules #225

Open
huzecong opened this issue Oct 7, 2019 · 0 comments
Open

Refactor data modules #225

huzecong opened this issue Oct 7, 2019 · 0 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request topic: data Issue about data loader modules

Comments

@huzecong
Copy link
Collaborator

huzecong commented Oct 7, 2019

There are several problems with the current data modules:

  1. Filtering has to happen at data source level. This means one can't discard examples in process, but instead must supply a filter_fn and wrap the source with a FilterDataSource.

    This is problematic in two ways: 1) users don't have access to the source in predefined sources such as MonoTextData; 2) filtering cannot depend on processing results. The second point led to a previous change that made TextLineDataSource return tokenized text, because it also needs to filter out sentences containing more than a certain number of words. This prevented users from applying it to arbitrary text, such as JSONL files.

    However, it shouldn't be hard to support discarding examples in process. For the interface, we can ask the user to return None to indicate discarding. The implementation should be straightforward --- just change the BatchSampler (and maybe DynamicBatchSampler) code.

  2. There's a very subtle bug that I just realized. When a dataset has the following configurations:

    • Source does not support random access (e.g., TextLine..., Iter...).
    • Lazy strategy is "process", i.e., data is eagerly loaded but lazily processed.
    • Caching strategy is not "loaded", i.e., either don't cache at all, or cache the processed examples.
    • A dynamic batching strategy is used (e.g., TokenCountBatchingStrategy).

    What happens under the hood is:

    • The source is wrapped in a _CachedDataSource to support random access by prefetching.
    • All raw examples are read from source when the dataset is constructed.
    • Since loaded raw examples are not cached, raw examples will be deleted from _CachedDataSource when it is accessed.
    • Iteration starts, the data iterator calls reset() on _CachedDataSource. This will result in the wrapped data source being unnecessarily iterated over again, and could be problematic if the source is an IterDataSource.
    • Dynamic batching strategies require access to the processed examples to determine whether it should be added to the batch. This access erases raw examples from _CachedDataSource.
    • The PyTorch data loader then tries to access the processed examples, but they're already deleted. This results in an exception.

    This case was not covered in our tests, because there are simply too many combinations when it comes to the data module. The current design requires a lot of checks and special handling of corner cases, and is not extensible to new features.

    A possible new design could be:

    1. For lazy loading and caching loaded raw examples, instead of writing the logic in both the iterator and dataset, isolate them into a third class. This class could be another wrapper over the source.
    2. We might be able to do the same for lazy processing and caching processed examples.
    3. Some of the current design was made to workaround certain limitations of the PyTorch data loader. We might as well just rewrite more parts of the data loader, since we already do to preserve compatibility with multiple PyTorch versions.
@huzecong huzecong added bug Something isn't working enhancement New feature or request topic: data Issue about data loader modules labels Oct 7, 2019
@huzecong huzecong self-assigned this Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request topic: data Issue about data loader modules
Projects
None yet
Development

No branches or pull requests

1 participant