Skip to content

Commit

Permalink
Updating code base to reflect changes to private model repo (#107)
Browse files Browse the repository at this point in the history
v1.7.0

 - Data file names now mirror the script names that created the files
 - Features on food inspections are now calculated separately
 - Features on business inspections are now calculated separately
 - The model code merges in the features, does not calculate features
 - Added script to adjust the public sanitarian data to match the schema of the private sanitarian file
 - More aggressive filtering functions
 - Separates out the violation matrix calculation into the parsing step and classification step (which, as it turns out will be useful for the new inspection format)
 - Refactoring model result / evaluation steps to accommodate future analysis


* adding prefix number to code and data, closes #100
* syncing and updating startup script, closes #101
* split violation matrix calculation into two steps, closes #102
* updated help example to remove unused variable
* adding nokey function, needed for new violation matrix calculation
* guard against too few categories in GenerateOtherLicenseInfo, closes 103
* updating filter functions to match model
* starting work described in #104 to split feature creation
* refactoring code for model compatibility
* simplifying initialization
  • Loading branch information
geneorama committed Apr 10, 2019
1 parent d6188d2 commit c522b7b
Show file tree
Hide file tree
Showing 47 changed files with 711 additions and 670 deletions.
66 changes: 53 additions & 13 deletions CODE/00_Startup.R
Original file line number Diff line number Diff line change
@@ -1,15 +1,55 @@
## INSTALL THESE DEPENDENCIES
install.packages("devtools",
dependencies = TRUE,
repos='http://cran.us.r-project.org')
install.packages("Rcpp",
dependencies = TRUE,
repos='http://cran.us.r-project.org')

## Update two packages not on CRAN using the devtools package.
devtools::install_github(repo = 'geneorama/geneorama')
devtools::install_github(repo = 'yihui/printr')
##------------------------------------------------------------------------------
## INSTALL DEPENDENCIES IF MISSING
##------------------------------------------------------------------------------

if(!"devtools" %in% rownames(installed.packages())){
install.packages("devtools",
dependencies = TRUE,
repos = "https://cloud.r-project.org/")
}

if(!"Rcpp" %in% rownames(installed.packages())){
install.packages("Rcpp",
dependencies = TRUE,
repos = "https://cloud.r-project.org/")
}

if(!"RSocrata" %in% rownames(installed.packages())){
install.packages("RSocrata",
dependencies = TRUE,
repos = "https://cloud.r-project.org/")
}

if(!"data.table" %in% rownames(installed.packages())){
install.packages("data.table",
dependencies = TRUE,
repos = "https://cloud.r-project.org/")
}

if(!"geneorama" %in% rownames(installed.packages())){
devtools::install_github('geneorama/geneorama')
}

if(!"printr" %in% rownames(installed.packages())){
devtools::install_github(repo = 'yihui/printr')
}

##------------------------------------------------------------------------------
## UPDATE DEPENDENCIES IF MISSING
##------------------------------------------------------------------------------

## Update to RSocrata 1.7.2-2 (or later)
## which is only on github as of March 8, 2016
devtools::install_github(repo = 'chicago/RSocrata')
if(installed.packages()["RSocrata","Version"] < "1.7.2-2"){
install.packages("RSocrata",
repos = "https://cloud.r-project.org/")
}

## Needs recent version for foverlaps
if(installed.packages()["data.table","Version"] < "1.10.0"){
install.packages("data.table",
repos = "https://cloud.r-project.org/")
}

if(installed.packages()["geneorama","Version"] < "1.5.0"){
devtools::install_github('geneorama/geneorama')
}
19 changes: 8 additions & 11 deletions CODE/11_business_download.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
if(interactive()){
##==========================================================================
## INITIALIZE
##==========================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Detach any non-standard libraries
geneorama::detach_nonstandard_packages()
}
##==============================================================================
## INITIALIZE
##==============================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "RSocrata"))
geneorama::sourceDir("CODE/functions/")
Expand Down Expand Up @@ -38,4 +35,4 @@ business[ , LICENSE_TERM_START_DATE := as.IDate(LICENSE_TERM_START_DATE, "%m/%d/
business[ , LICENSE_TERM_EXPIRATION_DATE := as.IDate(LICENSE_TERM_EXPIRATION_DATE, "%m/%d/%Y")]

## SAVE RESULT
saveRDS(business, "DATA/bus_license.Rds")
saveRDS(business, "DATA/11_bus_license.Rds")
19 changes: 8 additions & 11 deletions CODE/12_crime_download.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
if(interactive()){
##==========================================================================
## INITIALIZE
##==========================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Detach any non-standard libraries
geneorama::detach_nonstandard_packages()
}
##==============================================================================
## INITIALIZE
##==============================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "RSocrata"))
geneorama::sourceDir("CODE/functions/")
Expand Down Expand Up @@ -38,4 +35,4 @@ crime[ , Arrest := as.logical(Arrest)]
crime[ , Domestic := as.logical(Domestic)]

## SAVE RESULT
saveRDS(crime , "DATA/crime.Rds")
saveRDS(crime , "DATA/12_crime.Rds")
21 changes: 9 additions & 12 deletions CODE/13_food_inspection_download.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
if(interactive()){
##==========================================================================
## INITIALIZE
##==========================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Detach any non-standard libraries
geneorama::detach_nonstandard_packages()
}
##==========================================================================
## INITIALIZE
##==========================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "RSocrata"))
geneorama::sourceDir("CODE/functions/")
Expand All @@ -34,5 +31,5 @@ setnames(foodInspect, gsub("_+$","",colnames(foodInspect)))
geneorama::convert_datatable_IntNum(foodInspect)
geneorama::convert_datatable_DateIDate(foodInspect)

## SAVE ANSWER
saveRDS(foodInspect , "DATA/food_inspections.Rds")
## SAVE RESULT
saveRDS(foodInspect , "DATA/13_food_inspections.Rds")
20 changes: 9 additions & 11 deletions CODE/14_garbage_download.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
if(interactive()){
##==========================================================================
## INITIALIZE
##==========================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Detach any non-standard libraries
geneorama::detach_nonstandard_packages()
}
##==============================================================================
## INITIALIZE
##==============================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "RSocrata"))
geneorama::sourceDir("CODE/functions/")
Expand All @@ -34,4 +31,5 @@ geneorama::convert_datatable_IntNum(garbageCarts)
geneorama::convert_datatable_DateIDate(garbageCarts)

## SAVE RESULT
saveRDS(garbageCarts , "DATA/garbage_carts.Rds")
saveRDS(garbageCarts , "DATA/14_garbage_carts.Rds")

43 changes: 8 additions & 35 deletions CODE/15_sanitation_download.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
if(interactive()){
##==========================================================================
## INITIALIZE
##==========================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Detach any non-standard libraries
geneorama::detach_nonstandard_packages()
}
##==============================================================================
## INITIALIZE
##==============================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "RSocrata"))
geneorama::sourceDir("CODE/functions/")
Expand Down Expand Up @@ -37,29 +34,5 @@ geneorama::convert_datatable_IntNum(sanitationComplaints)
geneorama::convert_datatable_DateIDate(sanitationComplaints)

## SAVE RESULT
saveRDS(sanitationComplaints , "DATA/sanitation_code.Rds")

# ## Quick fix to download creation date, which is needed for the heat map calc
# ## The following block can be removed after issue 68 is resolved in RSocrata
# ## https://github.com/Chicago/RSocrata/issues/68
# crdate <- list()
# i <- 0
# while(length(crdate)==0 || length(crdate[[length(crdate)]]) == 50000 ){
# i <- i + 1
# url <- paste0("https://data.cityofchicago.org/resource/me59-5fac.csv",
# "?$select=creation_date&$LIMIT=50000",
# "&$OFFSET=", (i - 1) * 50000)
# crdate[[i]] <- httr::content(httr::GET(url), as = "text")
# crdate[[i]] <- strsplit(crdate[[i]], "\n")[[1]][-2]
# print(i)
# print(length(crdate[[i]]))
# }
# crdate <- do.call(c, crdate)
# crdate <- crdate[-1]
#
# length(crdate) == nrow(sanitationComplaints)
#
# crdate <- as.IDate(crdate, "%m/%d/%Y")
# sanitationComplaints$Creation_Date <- crdate

saveRDS(sanitationComplaints , "DATA/15_sanitation_code.Rds")

24 changes: 16 additions & 8 deletions CODE/21_calculate_violation_matrix.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,33 @@
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Detach libraries that are not used
geneorama::detach_nonstandard_packages()
## Load libraries that are used

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "MASS"))
## Load custom functions
geneorama::sourceDir("CODE/functions/")

##==============================================================================
## LOAD CACHED RDS FILES
##==============================================================================
foodInspect <- readRDS("DATA/food_inspections.Rds")
foodInspect <- readRDS("DATA/13_food_inspections.Rds")
foodInspect <- filter_foodInspect(foodInspect)

##==============================================================================
## CALCULATE FEATURES BASED ON FOOD INSPECTION DATA
##==============================================================================

## Calculate violation matrix and put into data.table with inspection id as key
vio_mat <- calculate_violation_matrix(foodInspect[ , Violations])

## Add key column to vio_mat
vio_mat <- data.table(vio_mat,
Inspection_ID = foodInspect[ , Inspection_ID],
key = "Inspection_ID")

## calculate_violation_types calculates violations by categories:
## Critical, serious, and minor violations
violation_dat <- calculate_violation_types(foodInspect$Violations,
Inspection_ID = foodInspect$Inspection_ID)
saveRDS(violation_dat, "DATA/violation_dat.Rds")
violation_dat <- calculate_violation_types(violation_mat =vio_mat)

## Save results
saveRDS(vio_mat, "DATA/21_food_inspection_violation_matrix_nums.Rds")
saveRDS(violation_dat, "DATA/21_food_inspection_violation_matrix.Rds")
21 changes: 9 additions & 12 deletions CODE/22_calculate_heat_map_values.R
Original file line number Diff line number Diff line change
@@ -1,26 +1,23 @@

##==============================================================================
## INITIALIZE
##==============================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)
## Detach libraries that are not used
geneorama::detach_nonstandard_packages()
## Load libraries that are used

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "MASS"))
## Load custom functions
geneorama::sourceDir("CODE/functions/")

##==============================================================================
## LOAD CACHED RDS FILES
##==============================================================================

## Import the key data sets used for prediction
foodInspect <- readRDS("DATA/food_inspections.Rds")
crime <- readRDS("DATA/crime.Rds")
garbageCarts <- readRDS("DATA/garbage_carts.Rds")
sanitationComplaints <- readRDS("DATA/sanitation_code.Rds")
foodInspect <- readRDS("DATA/13_food_inspections.Rds")
crime <- readRDS("DATA/12_crime.Rds")
garbageCarts <- readRDS("DATA/14_garbage_carts.Rds")
sanitationComplaints <- readRDS("DATA/15_sanitation_code.Rds")

## Apply filters by omitting rows that are not used in the model
foodInspect <- filter_foodInspect(foodInspect)
Expand Down Expand Up @@ -58,8 +55,8 @@ sanitationComplaints_heat <-
##==============================================================================
## SAVE HEAT MAP VALUES
##==============================================================================
saveRDS(burglary_heat, "DATA/burglary_heat.Rds")
saveRDS(garbageCarts_heat, "DATA/garbageCarts_heat.Rds")
saveRDS(sanitationComplaints_heat, "DATA/sanitationComplaints_heat.Rds")
saveRDS(burglary_heat, "DATA/22_burglary_heat.Rds")
saveRDS(garbageCarts_heat, "DATA/22_garbageCarts_heat.Rds")
saveRDS(sanitationComplaints_heat, "DATA/22_sanitationComplaints_heat.Rds")


74 changes: 74 additions & 0 deletions CODE/23_food_insp_features.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
##==============================================================================
## INITIALIZE
##==============================================================================
## Remove all objects; perform garbage collection
rm(list=ls())
gc(reset=TRUE)

## Load libraries & project functions
geneorama::loadinstall_libraries(c("data.table", "MASS"))
geneorama::sourceDir("CODE/functions/")
## Import shift function
shift <- geneorama::shift

##==============================================================================
## LOAD CACHED RDS FILES
##==============================================================================
foodInspect <- readRDS("DATA/13_food_inspections.Rds")

## Apply row filter to remove invalid data
foodInspect <- filter_foodInspect(foodInspect)

## Remove violations from food inspection, violations are caputured in the
## violation matrix data
foodInspect$Violations <- NULL

## Import violation matrix which lists violations by categories:
## Critical, serious, and minor violations
violation_dat <- readRDS("DATA/21_food_inspection_violation_matrix.Rds")

##==============================================================================
## CALCULATE FEATURES
##==============================================================================

## Facility_Type_Clean: Anything that is not "restaurant" or "grocery" is "other"
foodInspect[ , Facility_Type_Clean :=
categorize(x = Facility_Type,
primary = list(Restaurant = "restaurant",
Grocery_Store = "grocery"),
ignore.case = TRUE)]
## Join in the violation matrix
foodInspect <- merge(x = foodInspect,
y = violation_dat,
by = "Inspection_ID")
## Create pass / fail flags
foodInspect[ , pass_flag := ifelse(Results=="Pass",1, 0)]
foodInspect[ , fail_flag := ifelse(Results=="Fail",1, 0)]
## Set key to ensure that records are treated CHRONOLOGICALLY...
setkey(foodInspect, License, Inspection_Date)
## Then find previous info by "shifting" the columns (grouped by License)
foodInspect[ , pastFail := shift(fail_flag, -1, 0), by = License]
foodInspect[ , pastCritical := shift(criticalCount, -1, 0), by = License]
foodInspect[ , pastSerious := shift(seriousCount, -1, 0), by = License]
foodInspect[ , pastMinor := shift(minorCount, -1, 0), by = License]

## Calcualte time since last inspection.
## If the time is NA, this means it's the first inspection; add an inicator
## variable to indicate that it's the first inspection.
foodInspect[i = TRUE ,
j = timeSinceLast := as.numeric(
Inspection_Date - shift(Inspection_Date, -1, NA)) / 365,
by = License]
foodInspect[ , firstRecord := 0]
foodInspect[is.na(timeSinceLast), firstRecord := 1]
foodInspect[is.na(timeSinceLast), timeSinceLast := 2]
foodInspect[ , timeSinceLast := pmin(timeSinceLast, 2)]

##==============================================================================
## SAVE RDS
##==============================================================================
setkey(foodInspect, Inspection_ID)
saveRDS(foodInspect, file.path("DATA/23_food_insp_features.Rds"))



0 comments on commit c522b7b

Please sign in to comment.