-
Notifications
You must be signed in to change notification settings - Fork 1
3.Data Preprocessing
Active incidents were removed because active incidents are the noisy duplicated data of the "recent type incident" which was unable to remove during the data collection process. Thus, it does not contribute to the analysis.
There are many variations in the location column:
Such as -
- NE OAK SPRINGS FARM RD, CARLTON, OR
- W 10TH ST, LONG BEACH, CA
- 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))
- 175 NE 1ST ST, MCMINNVILLE, OR (MCMINNVILLE FIRE DEPARTMENT)
- E BARNETT RD, MEDFORD, OR
We can split the locations into multiple features -
State Text after the last comma appears to be the short form of US states or Canadian provinces.
CA -> California state
OR -> Oregon state
City Text after second last comma appears to be city name (or town, county name)
MEDFORD is a city in Oregon (last example - "E BARNETT RD, MEDFORD, OR")
Address Apart from state and city name, the rest will be counted as address features if there are three comma-separated elements (texts)
Address_2 Apart from state, city, and address, the rest will be counted as extended address (address_2) feature if there are four comma-separated element/string
Business Bracket enclosed string will be counted as Business Name.
From the above example - OJAI ARCADE (21002302) and MCMINNVILLE FIRE DEPARTMENT are counted as business feature
Example 1 (3 elements):
302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))
address =
302 E OJAI AVE
, city =OJAI
, state =CA
, business =OJAI ARCADE (21002302)
Example 2 (4 elements):
GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA
address =
GRASSIE BLVD
, address_2 =STE 212
, city =WINNIPEG
, state =MANITOBA
(wil be converted toMB
later)
def get_business_name(location):
# https://stackoverflow.com/a/38212061/11105356
stack = 0
start_index = None
results = []
for i, c in enumerate(location):
if c == '(':
if stack == 0:
start_index = i + 1 # string to extract starts one index later
# push to stack
stack += 1
elif c == ')':
# pop stack
stack -= 1
if stack == 0:
results.append(location[start_index:i])
try:
if len(results) == 0:
return None
elif len(results) == 1 and len(results[0]) == 1:
return None
elif len(results) == 1 and len(results[0])!=1:
return results[0].strip()
elif len(results) > 1 and len(results[0])==1:
return None
else:
return results[1].strip()
except IndexError as ie:
pass
# to extract bracket enclosed string
pulse_point_df['business'] = pulse_point_df.location.apply(lambda x : get_business_name(x))
# remove enclosed business name from the location string
pulse_point_location_data = pulse_point_df.apply(lambda row : row['location'].replace(str(row['business']), ''), axis=1)
# remove leftover bracket from the business replacemnt
# https://stackoverflow.com/a/49183590/11105356
# remove a (...) substring with a leading whitespace at the end of the string only
pulse_point_location_data = pulse_point_location_data.str.replace(r"\s*\([^()]*\)$","").str.strip()
There are very few numbers of four feature location
than three feature location
.
# split the location
four_col_location_split = ['address', 'address_2', 'city','state']
# four col indices
# pulse_point_location_data[pulse_point_location_data.str.split(',', expand=True)[3].notna()]
extra_loc_data = pulse_point_location_data.str.split(',', expand=True)
four_col_indices = extra_loc_data[extra_loc_data.apply(lambda x: np.all(pd.notnull(x[3])) , axis = 1)].index
four_col_loc_df = extra_loc_data.iloc[four_col_indices]
four_col_loc_df.columns = four_col_location_split
pulse_point_df.loc[four_col_loc_df.index , four_col_location_split] = four_col_loc_df
pulse_point_df[four_col_location_split] = pulse_point_df[four_col_location_split].apply(lambda x: x.str.strip())
# split the location
three_col_location_split = ['address', 'city','state']
four_col_loc_df_mask = extra_loc_data.index.isin(four_col_indices)
three_col_loc_df = extra_loc_data[~four_col_loc_df_mask].drop([3], axis=1)
three_col_loc_df.columns = three_col_location_split
# extra_loc_data[~three_col_loc_df][3].notna().sum() # to check null values
pulse_point_df.loc[three_col_loc_df.index , three_col_location_split] = three_col_loc_df
pulse_point_df[three_col_location_split] = pulse_point_df[three_col_location_split].apply(lambda x: x.str.strip())
To check null string -
mask = ((pulse_point_df.city.isna()) | (pulse_point_df.city==u'') )
pulse_point_df[mask]
It has been found that some of the business names are the same as the city names. I first removed the text containing business names and then performed text extraction for cities. That's why city names are blank for the cases like these. Those city names could be easily replaced with business names.
Total 47 States. Some of them are Canadian provinces, ex - MANITOBA
# https://www150.statcan.gc.ca/n1/pub/92-195-x/2011001/geo/prov/tbl/tbl8-eng.htm
# https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada
ca_province_dic = {
'Newfoundland and Labrador': 'NL',
'Prince Edward Island': 'PE',
'Nova Scotia': 'NS',
'New Brunswick': 'NB',
'Quebec': 'QC',
'Ontario': 'ON',
'Manitoba': 'MB',
'Saskatchewan': 'SK',
'Alberta': 'AB',
'British Columbia': 'BC',
'Yukon': 'YT',
'Northwest Territories': 'NT',
'Nunavut': 'NU',
}
# approach 1
# def handle_state(data_attr):
# for k, v in canada_provinces_dic.items():
# if data_attr.strip().lower() == k.lower():
# return canada_provinces_dic[k]
# else:
# return data_attr
# pulse_point_df['state'] = pulse_point_df.state.apply(handle_state)
# approach 2
# https://stackoverflow.com/a/69994272/11105356
ca_province_dict = {k.lower():v for k,v in ca_province_dic.items()}
pulse_point_df['state'] = pulse_point_df['state'].str.lower().map(ca_province_dict).fillna(pulse_point_df.state)
There could be some instances where the character length of the state is more than 2
mask = pulse_point_df.state.apply(lambda x:len(x)>2)
display(pulse_point_df[mask].state)
# keeping only the first segment
pulse_point_df[mask].state.apply(lambda x: x.split()[0])
From time string to seconds
For example - "1 h 34 m" will be 94*60 = 5640 seconds
#https://stackoverflow.com/a/57846984/11105356
UNITS = {'s':'seconds', 'm':'minutes', 'h':'hours', 'd':'days', 'w':'weeks'}
# chance of having days and weeks is none
def convert_to_seconds(s):
s = s.replace(" ","")
return int(timedelta(**{
UNITS.get(m.group('unit').lower(), 'seconds'): int(m.group('val'))
for m in re.finditer(r'(?P<val>\d+)(?P<unit>[smhdw]?)', s, flags=re.I)
}).total_seconds())
# convert_to_seconds("1 h 34 m")
# extract duration total time from “duration” column text
pulse_point_df["duration_in_seconds"] = pulse_point_df.duration.apply(lambda x:convert_to_seconds(x))
pulse_point_df["day_name"], pulse_point_df["weekday"] = pulse_point_df.date_of_incident.dt.day_name(), pulse_point_df.date_of_incident.dt.weekday
pulse_point_df["month_name"] = pulse_point_df.date_of_incident.dt.month_name()
## more features
pulse_point_df.date_of_incident.dt.month_name()
pulse_point_df.date_of_incident.dt.month
pulse_point_df.date_of_incident.dt.day
pulse_point_df.date_of_incident.dt.day_name()
pulse_point_df.date_of_incident.dt.weekday
pulse_point_df.date_of_incident.dt.isocalendar().week
I assigned Daytime values based on the time range below -
Time of the Day | Range |
---|---|
Morning | 5 AM to 11:59 AM |
Afternoon | 12PM to 4:59 PM |
Evening | 5 PM to 8:59 PM |
Night | 9 PM to 11:59 PM |
Midnight | 12 AM to 4:59 AM |
# https://stackoverflow.com/a/70018607/11105356
def time_range(time):
hour = datetime.strptime(time, '%I:%M %p').hour
if hour > 20:
return "Night"
elif hour > 16:
return "Evening"
elif hour > 11:
return "Afternoon"
elif hour > 4:
return "Morning"
else:
return "Midnight"
pulse_point_df["time_of_the_day"] = pulse_point_df.timestamp_time.apply(lambda time: time_range(time))