Split "create model data" step and fix inspector data #104

geneorama · 2017-12-20T22:45:28Z

In the model project we do not "create model data" as a separate step. Instead we generate food inspection features, business features, then we join those features to the food inspections to create the model data.

This approach is cleaner because all features are treated separate calculations, rather than having some features that are methodically created and some that calculated on the fly when the merge happens.

Also, being deliberate about the feature creation is an important step for the prediction step, because in the predictions the features are not joined to the food inspections, they are joined to the currently active food businesses.

Separating the feature creation from the merge is a little complicated but doable, but there is a problem with the sanitarians. Several steps need to be taken so that the inspector identities can be treated as an independent feature, and the steps needed to process the historical data in the repository is different than the steps that we use in production (in production the data is cleaner, we didn't have this data source available when we did the evaluation).

So, in the evaluation there need to be a helper script / step to get the inspector data into a format that is simply Inspection_ID and Inspector_Assigned. This involves:

Getting the inspector data ready so that it can be matched to food inspections using the License Number and Inspection Date. This match will allow us to get the Inspection ID
Cleaning up the License Number of the inspectors to improve the match quality
Removing invalid records
Deduplicating inspections / inspection records
Matching / merging the inspection data onto the inspector data, and subsetting to the two columns we want

I have a working version of the code, but the xmat is different from the original by about 20 rows 18,739 rows instead of 18,712. I remember encountering this when I did the refactoring the first time, but I don't remember what caused the discrepancy. It has to do with something in the order of what steps happened in the original dat_model creation, versus the new process of creating two separate files.

After this step is complete, we will be much closer to replicating the process in the model project, except for the prediction step.

geneorama · 2017-12-22T16:28:41Z

I encountered a slight wrinkle... I also need to change the model script in some way so that we can read in the canned weather data.

In the model we read in the weather from DATA/17_mongo_weather_update.Rds, which is a result of the script CODE/17_mongo_weather_update.R. The download script gets the weather data from Mongo of course, which an internal system. (It's called "update" because it's the

In the evaluation we read the weather from DATA/weather_20110401_20141031.Rds.

Two solutions:

Simply rename weather_20110401_20141031.Rds to 17_mongo_weather_update.Rds
Update the mongo weather with some more recent weather data that reflects what we actually use in the model.

The weather data that we use in the model is similar to the data stored in the project, but it's not the same. I vaguely remember reconciling this at some point, the plots looks familiar

Here pi is the new precipitation intensity

and tm is the new temperature max

I'm leaning toward the first solution, i.e. renaming the file so that the data pipeline just works.

geneorama · 2017-12-22T16:59:32Z

One more problem before I can make sure that the new model code works; the tobacco category changed from tobacco_retail_over_counter to tobacco, so again the canned data doesn't work in the model.

In this case I think it would be best to refresh the business license information.

Other solutions, like conditionally renaming the field / column header are also possible, but I think they're more likely to lead to confusion and cause bugs.

The business license data has changed quite a bit since stored it back in the first evaluation, there are more columns and (of course) many more records, so it's a much larger data set than before, but the content should be the same.

@tomschenkjr thoughts on this?

tomschenkjr · 2017-12-22T17:21:23Z

Disregarding the extra columns, how many different records during the evaluation period? _ Tom Schenk Jr. Chief Data Officer Department of Innovation and Technology City of Chicago (312) 744-2770 [email protected] | @ChicagoCDO data.cityofchicago.org | opengrid.io | digital.cityofchicago.org | chicago.github.io | dev.cityofchicago.org

…

________________________________ From: Gene Leynes <[email protected]> Sent: Friday, December 22, 2017 10:59:32 AM To: Chicago/food-inspections-evaluation Cc: Schenk, Tom; Mention Subject: Re: [Chicago/food-inspections-evaluation] Split "create model data" step and fix inspector data (#104) One more problem before I can make sure that the new model code works; the tobacco category changed from tobacco_retail_over_counter to tobacco, so again the canned data doesn't work in the model. In this case I think it would be best to refresh the business license information. Other solutions, like conditionally renaming the field / column header are also possible, but I think they're more likely to lead to confusion and cause bugs. The business license data has changed quite a bit since stored it back in the first evaluation, there are more columns and (of course) many more records, so it's a much larger data set than before, but the content should be the same. @tomschenkjr<https://github.com/tomschenkjr> thoughts on this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#104 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABkC0YAFzhBa9I4ZzdLb8CRniGWe0vPhks5tC9_0gaJpZM4RJEBq>.

________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

geneorama · 2017-12-22T17:31:22Z

I just re-downloaded the business data.

Previously we had 470,994 records, now we have 923,834 records.

However, the number of business records that go into the model as features is the same after the download, 27.600. So, it looks like the business data is consistent after doing the filtering, subsetting, and matching.

geneorama · 2017-12-22T17:33:23Z

Note that the records on the data portal represent Licenses... but the 27,600 figure represents the licenses reshaped and summarized by business.

geneorama · 2017-12-22T17:45:00Z

So after I make all these changes and refresh the business license data the final xmat looks pretty similar to the original. The first few rows are identical and the number or rows in xmat goes from 18,712 to 18,781.

old xmat structure:

# > str(xmat)
# Classes ‘data.table’ and 'data.frame':	18712 obs. of  13 variables:
#  $ Inspection_ID                              : num  269961 507211 507212 507216 507219 ...
#  $ Inspector                                  : chr  "green" "blue" "blue" "blue" ...
#  $ pastSerious                                : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ pastCritical                               : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ timeSinceLast                              : num  2 2 2 2 2 2 2 2 2 2 ...
#  $ ageAtInspection                            : int  1 1 1 1 1 1 0 1 1 0 ...
#  $ consumption_on_premises_incidental_activity: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ tobacco_retail_over_counter                : num  1 0 0 0 0 0 0 0 0 0 ...
#  $ temperatureMax                             : num  53.5 59 59 56.2 52.7 ...
#  $ heat_burglary                              : num  26.99 13.98 12.61 35.91 9.53 ...
#  $ heat_sanitation                            : num  37.75 15.41 8.32 38.19 2.13 ...
#  $ heat_garbage                               : num  12.8 12.9 8 26.2 3.4 ...
#  $ criticalFound                              : num  0 0 0 0 0 0 0 0 0 0 ...
#  - attr(*, "sorted")= chr "Inspection_ID"
#  - attr(*, ".internal.selfref")=<externalptr>

new xmat structure:

> str(xmat)
Classes ‘data.table’ and 'data.frame':	18781 obs. of  13 variables:
 $ Inspection_ID                              : num  269961 507211 507212 507216 507219 ...
 $ criticalFound                              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ pastSerious                                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ageAtInspection                            : int  1 1 1 1 1 1 0 1 1 0 ...
 $ pastCritical                               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ consumption_on_premises_incidental_activity: num  0 0 0 0 0 0 0 0 0 0 ...
 $ tobacco                                    : num  1 0 0 0 0 0 0 0 0 0 ...
 $ temperatureMax                             : num  53.5 59 59 56.2 52.7 ...
 $ heat_burglary                              : num  26.99 13.98 12.61 35.91 9.53 ...
 $ heat_sanitation                            : num  37.75 15.41 8.32 38.19 2.13 ...
 $ heat_garbage                               : num  12.8 12.9 8 26.2 3.4 ...
 $ Inspector                                  : chr  "green" "blue" "blue" "blue" ...
 $ timeSinceLast                              : num  2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "sorted")= chr "Inspection_ID"
 - attr(*, ".internal.selfref")=<externalptr>

Obviously looking at the structure isn't very detailed, but I'll check out the model in a minute.

geneorama · 2017-12-22T17:48:39Z

Actually it's going to take a minute to do an apples to apples comparison. The production model doesn't split the test / train data, because in production we build the model on everything.

geneorama · 2017-12-22T21:58:44Z

It looks like the model is producing very similar results, but I'm not ready to push up the draft of the evaluation.

The gini on the test data is 34.6%, the previous number was 34.5%... so the results look comparable.

geneorama · 2017-12-29T17:29:01Z

I mostly finished up the split on Wednesday, then I noticed a few things that needed attention yesterday and I fixed those and pushed up a big commit. I'm testing it now with the downstream prediction step.

It was tricky to find a clean way to run the production and evaluation models with the same code. I solved this by running the model twice, once with all the data and once with all the data except the past 90 days. This works because the evaluation data has that big gap between the test / train periods, so we didn't need to be explicit about the exact start and end of the experiment; it's implicitly defined by the availability of the data.

I also had to reorganize the workflow a bit to accommodate handling the test / train index management. Previously we could just take out the NA values right before we fit the model, but now we needed to be more careful or else the matrix would have different rows than the source data frame... which is where the test train index is stored. Obviously there are lots of ways to solve this sort of thing, I tried to choose something that kept the code easy to follow and audit.

v1.7.0 - Data file names now mirror the script names that created the files - Features on food inspections are now calculated separately - Features on business inspections are now calculated separately - The model code merges in the features, does not calculate features - Added script to adjust the public sanitarian data to match the schema of the private sanitarian file - More aggressive filtering functions - Separates out the violation matrix calculation into the parsing step and classification step (which, as it turns out will be useful for the new inspection format) - Refactoring model result / evaluation steps to accommodate future analysis * adding prefix number to code and data, closes #100 * syncing and updating startup script, closes #101 * split violation matrix calculation into two steps, closes #102 * updated help example to remove unused variable * adding nokey function, needed for new violation matrix calculation * guard against too few categories in GenerateOtherLicenseInfo, closes 103 * updating filter functions to match model * starting work described in #104 to split feature creation * refactoring code for model compatibility * simplifying initialization

geneorama added a commit that referenced this issue Dec 21, 2017

starting work described in #104 to split feature creation

9a10616

geneorama added a commit that referenced this issue Apr 13, 2019

starting work described in #104 to split feature creation

0143794

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split "create model data" step and fix inspector data #104

Split "create model data" step and fix inspector data #104

geneorama commented Dec 20, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

tomschenkjr commented Dec 22, 2017 via email

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 29, 2017

Split "create model data" step and fix inspector data #104

Split "create model data" step and fix inspector data #104

Comments

geneorama commented Dec 20, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

tomschenkjr commented Dec 22, 2017 via email

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 22, 2017

geneorama commented Dec 29, 2017