Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make itertomap loading more lazy #1016

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

SvenDS9
Copy link
Contributor

@SvenDS9 SvenDS9 commented Feb 15, 2023

Fixes #454

Changes

  • Only load from datapipe until requested element is loaded
  • Add test for this behavior

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 15, 2023
@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Feb 24, 2023

Some tests in torchvision fail. Apparently a stream isn't being closed and my implemenation of itertomap __getitem__() causes this.

The stream in question I think is being created here:
https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/test/builtin_dataset_mocks.py#L1356

This causes this test to fail:
https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/test/test_prototype_datasets_builtin.py#L122-L140
While there is a small bug in this test (closing the stream inside the loop changes the dictionary size during iteration which causes subsequent tests to also fail as the dict isn't fully empty for the next test) fixing this wouldn't solve the issue.

https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/torchvision/prototype/datasets/_builtin/cub200.py#L194-L208

As I don't know how to fix it I would be happy if someone else could chime in.

@pmeier
Copy link
Contributor

pmeier commented Mar 9, 2023

@ejguan I guess you have the most context on the errors given you fixed pytorch/vision#6997. We just merged pytorch/vision#7403 to make diagnosing this easier. This patch will hit with tomorrows nightly. Could you also have a look here?

@ejguan
Copy link
Contributor

ejguan commented Mar 9, 2023

@ejguan I guess you have the most context on the errors given you fixed pytorch/vision#6997. We just merged pytorch/vision#7403 to make diagnosing this easier. This patch will hit with tomorrows nightly. Could you also have a look here?

@pmeier Thanks, I think pytorch/vision#7403 does the right job. I will take a look at this PR to see why such issue happens with this PR

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the exact reason why this PR fails TorchVision. I have a strong feeling that it's because the iterator object is never properly cleaned up.

torchdata/datapipes/iter/util/converter.py Show resolved Hide resolved
torchdata/datapipes/iter/util/converter.py Outdated Show resolved Hide resolved
@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Mar 9, 2023

I think so as well.

@pmeier Thank you for linking pytorch/vision#6997 I think I know what's happening now!
Forcefully depleting the dp (e.g by adding a Prefetcher in cub200 with a large enough buffer) which defeats the purpose of this PR - saving space - solves the issues with the test.

Probably in the test not every value is retrieved from the map, therefore the map is never fully loaded and the iter of the previous dp never "finishes".

How would we address this if we want lazy loading?

@ejguan
Copy link
Contributor

ejguan commented Mar 9, 2023

Forcefully depleting the dp (e.g by adding a Prefetcher in cub200 with a large enough buffer) which defeats the purpose of this PR - saving space - solves the issues with the test.

I think it's the problem that the iterator which opens file handles never reaches finally clause to close them e,g, code pointer. The reason depleting works is it would forcefullly to exit the iterator to execute finally. And, that's the reason I think removing self.itr would help.

@pmeier
Copy link
Contributor

pmeier commented Mar 10, 2023

@SvenDS9 seems like the test now displays a proper error message: https://github.com/pytorch/data/actions/runs/4384715686/jobs/7676523545#step:9:2098. So the failure is still ongoing, but at least it is now clear what is happening.

@pmeier
Copy link
Contributor

pmeier commented Mar 10, 2023

Just as a heads-up: feel free to change anything under torchvision.prototype.datasets if it doesn't work well with torchdata. This is very much work in progress and nothing is set in stone. So if there is a better way to solve the issue we had that unblocks this PR, send a PR to torchvision and ping me there.

@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Mar 13, 2023

Probably in the test not every value is retrieved from the map, therefore the map is never fully loaded and the iter of the previous dp never "finishes".

I am pretty sure that is what happens. That also explains why removing the reference of iterator when it's depleted doesn't help as this code is never reached. As the iterator hasn't finished yet, the stream needs to remain open so that more elements can be retrieved from the dp. So in a way this is expected behavior.

I think we should make load_map() public so that users can choose to load the entire map in one go. Currently (in the nightly build) this is done the first time an element is requested from the map.

To fix the tests in torchvision we could either:

WDYT @ejguan @pmeier ?

@pmeier
Copy link
Contributor

pmeier commented Mar 13, 2023

Of the two options provided, I strongly favor 2. since 1. would only fix the test, but not the behavior. I'll let @ejguan comment on whether the actual proposal is the way to go here.

That being said, we are currently not actively working on the datasets and thus I'm also ok with 1. to unblock. However, this means if we pick this up again in the future, we need a proper solution then.

@ejguan
Copy link
Contributor

ejguan commented Mar 13, 2023

So, I guess we need to figure out a way to let users to indicate when they have done with MapDataPipe then deleting/depleting the iterator of prior DataPipe (it would be better if we can make it automatically).
Another note: adding __del__ to itertomap won't help because it's only invoked periodically by gc. So, at the time of TorchVision's test, there is a chance that self._itr has not been removed then file handles are still not closed.

Here is my third proposal, which requires changes from PyTorch, TorchData and TorchVision. In PyTorch Core, add a base callback function for all MapDataPipe at the end of epoch. For cub200, instead of using Mapper, we can rely on MapKeyZipper to combine them then using merge_fn to drop the value from split_dp. In TorchData, MapKeyZipper knows when the IterDataPipe ends, then invoke the callback function for the MapDataPipe.

Any suggestion is welcomed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make IterToMap loading more lazily
4 participants