-
Notifications
You must be signed in to change notification settings - Fork 99
Data Pipelining and Random Number Sequencing Design
Ben Stabler edited this page May 12, 2017
·
18 revisions
- Restartable model runs. For example, we have a model with three sub-models A, B, and C. Yesterday we ran sub-models A, B, and C and today we want to just run sub-model C with different settings. We need to keep track of the state of the system after sub-models A and B are run since they are presumably inputs to sub-model C.
- We need to run a sub-model for a household, person, tour, trip, etc. and get the same random number draw regardless of the computer used, if other households are being simulated at the same time, or if inputs changed (i.e. it is a different scenario). For example household vehicle ownership, person work location, tour mode choice.
- We need as-stable-as-possible random number sequencing across scenarios and sample rates for households, persons, tours, trips, sub-models, etc.
- We need stable random numbers with restartable data pipelining
- Create our own framework so it can work with orca
- We are reducing our dependency on orca, but not abandoning it since that would be too expensive
- Orca tables are being saved to the datastore as pandas data frames and then being wrapped as orca tables on I/O
- Each household, person, tour, trip, sub-model has a random number stream and offset. For example, when the model runs sub-model A it uses the first offset, sub-model B uses the second offset, and sub-model C uses the third offset. If we restart the model run at sub-model C, it sees in the datastore that sub-models A and B were run and that offsets 1 and 2 have already been used as well.
- The offsets are by sub-model run order, not sub-model name; this is more flexible and avoids requiring an a priori dictionary
The revised model run setup looks like this:
_MODELS = [
'compute_accessibility',
'school_location_simulate',
'workplace_location_simulate',
'auto_ownership_simulate',
'cdap_simulate',
'mandatory_tour_frequency',
'mandatory_scheduling',
'non_mandatory_tour_frequency',
'destination_choice',
'non_mandatory_scheduling',
'tour_mode_choice_simulate',
# 'trip_mode_choice_simulate'
]
#resume_after = 'mandatory_scheduling'
resume_after = None
pipeline.get_rn_generator().set_base_seed(0) #global seed
pipeline.run(models=_MODELS, resume_after=resume_after)
Here is the contents of the data pipeline HDF5 file, which contains the state of pandas DataFrames after each sub-model if they are revised by the sub-model. You can see that the number of columns changes as the sub-models are run.
<class 'pandas.io.pytables.HDFStore'> File path: pipeline.h5
/accessibility/compute_accessibility (shape->[25,21])
/checkpoints (shape->[12,11])
/households/compute_accessibility (shape->[1000,64])
/households/auto_ownership_simulate (shape->[1000,67])
/households/cdap_simulate (shape->[1000,68])
/land_use/compute_accessibility (shape->[25,49])
/mandatory_tours/mandatory_tour_frequency (shape->[766,4])
/mandatory_tours/mandatory_scheduling (shape->[766,5])
/non_mandatory_tours/non_mandatory_tour_frequency (shape->[1256,4])
/non_mandatory_tours/destination_choice (shape->[1256,5])
/non_mandatory_tours/non_mandatory_scheduling (shape->[1256,6])
/persons/compute_accessibility (shape->[1549,50])
/persons/school_location_simulate (shape->[1549,54])
/persons/workplace_location_simulate (shape->[1549,59])
/persons/cdap_simulate (shape->[1549,64])
/persons/mandatory_tour_frequency (shape->[1549,69])
/persons/non_mandatory_tour_frequency (shape->[1549,72])
/tours/tour_mode_choice_simulate (shape->[2022,37])
- Random number generation is done using numpy's Mersenne Twister PNRG
- ActivitySim uses a stream of random numbers for each household id, person id, tour id, (soon trip id), and model step offset
- The seed (offset/starting point) is based on the global seed, household id, person id, tour id, (soon trip id), and model step offset. The equation looks something like this:
chooser.index * number of models for chooser + chooser model offset + global seed offset
for example
household.id * 2 + 0 + 1
where:
household.id = household table index
2 = number of household level models - auto ownership and cdap
0 = first household model - auto ownership
1 = global seed offset for testing the same model under different random global seeds
- Tour id is segmented by tour type
- The sequencing is thread/process safe for eventual multiprocessor support