Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocess Fails Accessing Sharrow Cache #876

Closed
dhensle opened this issue Jul 9, 2024 · 8 comments
Closed

Multiprocess Fails Accessing Sharrow Cache #876

dhensle opened this issue Jul 9, 2024 · 8 comments
Labels
Bug Something isn't working/bug f

Comments

@dhensle
Copy link
Contributor

dhensle commented Jul 9, 2024

Describe the bug
Model run is crashing with a permissions error trying to access the sharrow flow cache in the vehicle allocation model.

I was successfully able to complete the run with 4, 8, 12, and 20 cores, but 16 and 24 cores failed. (Machine has 24 cores total.) I suspect the two runs that failed just got unlucky with the different processes trying to access the same file at the same time. I do not know why this has happened on the vehicle allocation model both times though.

Potential fix: Put in a wait statement if access to the sharrow cache is denied?

To Reproduce
Steps to reproduce the behavior:
Run MTC example model with sharrow on and multiprocess with many cores.

Screenshots
image
crash_logs.zip

Additional context
I will try to re-run 16 and 24 and see if I hit the same error again.

@dhensle dhensle added the Bug Something isn't working/bug f label Jul 9, 2024
@i-am-sijia
Copy link
Contributor

Is it somehow recompiling? The trace shows it's calling sharrow flow rewrite(). It would make sense if multiple processors are rewriting the same file and they clash.

@dhensle
Copy link
Contributor Author

dhensle commented Jul 10, 2024

Tried mp with sharrow on for the SANDAG ABM3 model as well using 4 cores. It also crashed with the same error in vehicle allocation. I agree that the problem seems to be sharrow recompiling after missing one of the vehicle type alternatives.

Interestingly, I do not see sharrow recompiling the the vehicle_allocation model when running in single process...

Log files are attached below:
logs.zip

@jpn--
Copy link
Member

jpn-- commented Jul 11, 2024

It looks like you are maybe setting the state.filesystem.sharrow_cache_dir attribute to be "cache", so that get_sharrow_cache_dir() is coming up outputs\output_sh_mp_full_4cores\cache on your multiprocess run. When you run single process it comes up as outputs\output_sh_full_1thread\cache. So, when you're running multiprocess you're not picking up the pre-compiled sharrow code, but attempting to recompile it. I think you then crash on vehicle_allocation.simple_simulate.eval_mnl probably because it is the first thing that take takes enough time to compile such that a processes starts the compile step while another processes is still doing the compiling.

We can tell it's not a data-type problem because two processes are competing to compile the same "flow", which has a unique hash that ensures the data types are actually the same (if they were different it wouldn't crash as there would be different files with different hashes, it would just be slow).

You can fix this by using the same explicit absolute-path cache location for both model runs, or just by activating the persistent sharrow cache, as I did here in the example exercise script. This moves the sharrow cache out of a run-specific directory and into a common run-agnostic directory, so it gets re-used in cases like this.

@dhensle
Copy link
Contributor Author

dhensle commented Jul 12, 2024

We all discussed offline and came to a conclusion. TL;DR We need to convert the vehicle allocation body type and fuel types into integers so as to not trigger sharrow recompilation when running a full sample size.

First, the above comment doesn't really apply in this case as I copy the cache from the compile folder to the output folder I am running in.

Tracing through the cache of the compiled run with 100 households and the multiprocessing run that failed shows that not all alternatives in the vehicle allocation model are captured by the small sample size. The below screenshot shows how the alternatives for body type and fuel type in the vehicle_allocation model are not the same. (The left file is the compile run and is missing alternatives that are present in the full model run on the right.)
image

This issue existed in single processing but was not caught since the single process just recompiled the vehicle allocation model and proceeded on its way. Multi-processing crashed when re-compiling due to the multiple cores trying to write to the same cache file.

The fix is to integer encode the body type and fuel type in the vehicle allocation model so that sharrow does not need to know all of the possible categorical options when compiling.

@JoeJimFlood
Copy link
Contributor

I got the same thing for nonmandatory tour scheduling
image

@JoeJimFlood
Copy link
Contributor

I got it in trip destination as well

@jpn--
Copy link
Member

jpn-- commented Jul 17, 2024

@JoeJimFlood are you running multiprocessing with a small(ish) household sample size?

For the fuel and body type variables discussed higher up this thread, they were getting created as categorical within a preprocessor, and not having stable dtype on that account. Your issue with tour_type is different because that column is getting created by core Python code. I don't at first glance see how the categories could get out of order unless possibly some subprocess is handling a sample part so small that some tour types are just missing.

@jpn--
Copy link
Member

jpn-- commented Jul 25, 2024

I am closing this, as the original bug in this issue has been addressed by changing the data type of variables in the vehicle allocation pre-processor.

@JoeJimFlood, if you are still encountering problems you can reopen or create a new issue with additional details.

@jpn-- jpn-- closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working/bug f
Projects
None yet
Development

No branches or pull requests

4 participants