-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very Slow Chunking #861
Labels
Bug
Something isn't working/bug f
Comments
As mentioned above, I tested with the current main branch of the code and the sandag-abm3-example. The results were very similar. I ran with 100k households in chunk_training mode without sharrow and with 10 cores. The chunk training run took about 24 hours! Log files are attached: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Chunk training takes a VERY long time.
Performed on a SANDAG server with 1 TB of RAM (chunk_size was set to 450GB), ran with only 64k households (~5%) and 5 cores. Run time was 66.85 hours, or 2.78 days!
To Reproduce
Run the SANDAG ABM3 model in chunk training mode. This was performed with the BayDAG_estimation branch which is based off ActivitySim version 1.2.
Expected behavior
Chunk training shouldn't take all that much longer than actually running the model. We have not seen this long of chunk training behavior before. Is there something about the SANDAG model that takes a long time? (e.g. two-zones?) Is the problem a dependency was updated that really hit the performance?
Additional context
Log files can be seen here: training_log.zip
Running in production mode also took an extremely long time (again > 2.5 days!). Part of the problem may be that the num_processors setting was set to 40, but the machine only had 32, but this shouldn't make that big of a deal.
Looking at the production logs shows that about 700 minutes(!) of run time was in the parking location choice model. This looks to be due to ActivitySim creating a chunk for every single chooser in that model (hence the statements like Running chunk 10450 of 10456 with 1 of 10456 choosers in the log.) The chunk_cache.csv (found in the training_log above) certainly shows that more than one row should be allowed per chunk when the chunk_size is set to 450GB.
production_log_subset.zip
Is this behavior related to #860?
(Currently working on reproducing with the main branch, but run is not yet complete. I will update once complete...)
The text was updated successfully, but these errors were encountered: