Skip to content

3.Data Preprocessing

Ahmed Shahriar Sakib edited this page Dec 30, 2021 · 1 revision

Data Cleaning

Active incidents were removed because active incidents are the noisy duplicated data of the "recent type incident" which was unable to remove during the data collection process. Thus, it does not contribute to the analysis.

Feature Extraction

Location Column

There are many variations in the location column:

Such as -

  • NE OAK SPRINGS FARM RD, CARLTON, OR
  • W 10TH ST, LONG BEACH, CA
  • 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))
  • 175 NE 1ST ST, MCMINNVILLE, OR (MCMINNVILLE FIRE DEPARTMENT)
  • E BARNETT RD, MEDFORD, OR

We can split the locations into multiple features -

State Text after the last comma appears to be the short form of US states or Canadian provinces.

CA -> California state

OR -> Oregon state

City Text after second last comma appears to be city name (or town, county name)

MEDFORD is a city in Oregon (last example - "E BARNETT RD, MEDFORD, OR")

Address Apart from state and city name, the rest will be counted as address features if there are three comma-separated elements (texts)

Address_2 Apart from state, city, and address, the rest will be counted as extended address (address_2) feature if there are four comma-separated element/string

Business Bracket enclosed string will be counted as Business Name.

From the above example - OJAI ARCADE (21002302) and MCMINNVILLE FIRE DEPARTMENT are counted as business feature

Split Location

Example 1 (3 elements): 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))

address = 302 E OJAI AVE, city = OJAI, state = CA, business = OJAI ARCADE (21002302)

Example 2 (4 elements): GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA

address = GRASSIE BLVD, address_2 = STE 212, city = WINNIPEG, state = MANITOBA (wil be converted to MB later)

Utility Script

Extract Business Name

def get_business_name(location):
    # https://stackoverflow.com/a/38212061/11105356
    stack = 0
    start_index = None
    results = []

    for i, c in enumerate(location):
        if c == '(':
            if stack == 0:
                start_index = i + 1  # string to extract starts one index later

            # push to stack
            stack += 1
        elif c == ')':
            # pop stack
            stack -= 1

            if stack == 0:
                results.append(location[start_index:i])
    try:
      if len(results) == 0:
        return None
      elif len(results) == 1 and len(results[0]) == 1:
        return None
      elif len(results) == 1 and len(results[0])!=1:
        return results[0].strip()
      elif len(results) > 1 and len(results[0])==1:
        return None
      else:
        return results[1].strip()
    except IndexError as ie:
      pass
# to extract bracket enclosed string 
pulse_point_df['business'] = pulse_point_df.location.apply(lambda x : get_business_name(x))

# remove enclosed business name from the location string
pulse_point_location_data = pulse_point_df.apply(lambda row : row['location'].replace(str(row['business']), ''), axis=1)

# remove leftover bracket from the business replacemnt
# https://stackoverflow.com/a/49183590/11105356
# remove a (...) substring with a leading whitespace at the end of the string only
pulse_point_location_data = pulse_point_location_data.str.replace(r"\s*\([^()]*\)$","").str.strip()

There are very few numbers of four feature location than three feature location.

Four Column Location

# split the location
four_col_location_split = ['address', 'address_2', 'city','state']

# four col indices
# pulse_point_location_data[pulse_point_location_data.str.split(',', expand=True)[3].notna()]

extra_loc_data = pulse_point_location_data.str.split(',', expand=True)
four_col_indices = extra_loc_data[extra_loc_data.apply(lambda x: np.all(pd.notnull(x[3])) , axis = 1)].index
four_col_loc_df = extra_loc_data.iloc[four_col_indices]
four_col_loc_df.columns = four_col_location_split

pulse_point_df.loc[four_col_loc_df.index , four_col_location_split] = four_col_loc_df
pulse_point_df[four_col_location_split] = pulse_point_df[four_col_location_split].apply(lambda x: x.str.strip())

Three Column Location

# split the location
three_col_location_split = ['address', 'city','state']

four_col_loc_df_mask = extra_loc_data.index.isin(four_col_indices)
three_col_loc_df = extra_loc_data[~four_col_loc_df_mask].drop([3], axis=1)
three_col_loc_df.columns = three_col_location_split

# extra_loc_data[~three_col_loc_df][3].notna().sum() # to check null values

pulse_point_df.loc[three_col_loc_df.index , three_col_location_split] = three_col_loc_df
pulse_point_df[three_col_location_split] = pulse_point_df[three_col_location_split].apply(lambda x: x.str.strip())

City

To check null string -

mask = ((pulse_point_df.city.isna()) | (pulse_point_df.city==u'') )
pulse_point_df[mask]

It has been found that some of the business names are the same as the city names. I first removed the text containing business names and then performed text extraction for cities. That's why city names are blank for the cases like these. Those city names could be easily replaced with business names.

State

Total 47 States. Some of them are Canadian provinces, ex - MANITOBA

Canadian Province Mapping

# https://www150.statcan.gc.ca/n1/pub/92-195-x/2011001/geo/prov/tbl/tbl8-eng.htm
# https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada

ca_province_dic = {
    'Newfoundland and Labrador': 'NL',
    'Prince Edward Island': 'PE',
    'Nova Scotia': 'NS',
    'New Brunswick': 'NB',
    'Quebec': 'QC',
    'Ontario': 'ON',
    'Manitoba': 'MB',
    'Saskatchewan': 'SK',
    'Alberta': 'AB',
    'British Columbia': 'BC',
    'Yukon': 'YT',
    'Northwest Territories': 'NT',
    'Nunavut': 'NU',
}

# approach 1

# def handle_state(data_attr):
#   for k, v in canada_provinces_dic.items():
#       if data_attr.strip().lower() == k.lower():
#         return canada_provinces_dic[k]
#   else:
#     return data_attr

# pulse_point_df['state'] =  pulse_point_df.state.apply(handle_state)


# approach 2

# https://stackoverflow.com/a/69994272/11105356

ca_province_dict = {k.lower():v for k,v in ca_province_dic.items()}
pulse_point_df['state']  = pulse_point_df['state'].str.lower().map(ca_province_dict).fillna(pulse_point_df.state)

Other Noises

There could be some instances where the character length of the state is more than 2

mask = pulse_point_df.state.apply(lambda x:len(x)>2)
display(pulse_point_df[mask].state)

# keeping only the first segment
pulse_point_df[mask].state.apply(lambda x: x.split()[0])

Time

Conversion

From time string to seconds

For example - "1 h 34 m" will be 94*60 = 5640 seconds

#https://stackoverflow.com/a/57846984/11105356

UNITS = {'s':'seconds', 'm':'minutes', 'h':'hours', 'd':'days', 'w':'weeks'}

# chance of having days and weeks is none 

def convert_to_seconds(s):
    s = s.replace(" ","")
    return int(timedelta(**{
        UNITS.get(m.group('unit').lower(), 'seconds'): int(m.group('val'))
        for m in re.finditer(r'(?P<val>\d+)(?P<unit>[smhdw]?)', s, flags=re.I)
    }).total_seconds())

# convert_to_seconds("1 h 34 m")

# extract duration total time from “duration” column text
pulse_point_df["duration_in_seconds"] = pulse_point_df.duration.apply(lambda x:convert_to_seconds(x))

New Features

pulse_point_df["day_name"], pulse_point_df["weekday"] = pulse_point_df.date_of_incident.dt.day_name(), pulse_point_df.date_of_incident.dt.weekday

pulse_point_df["month_name"] = pulse_point_df.date_of_incident.dt.month_name()


## more features

pulse_point_df.date_of_incident.dt.month_name()
pulse_point_df.date_of_incident.dt.month

pulse_point_df.date_of_incident.dt.day
pulse_point_df.date_of_incident.dt.day_name()

pulse_point_df.date_of_incident.dt.weekday
pulse_point_df.date_of_incident.dt.isocalendar().week

Time of the day

I assigned Daytime values based on the time range below -

Time of the Day Range
Morning 5 AM to 11:59 AM
Afternoon 12PM to 4:59 PM
Evening 5 PM to 8:59 PM
Night 9 PM to 11:59 PM
Midnight 12 AM to 4:59 AM
# https://stackoverflow.com/a/70018607/11105356

def time_range(time):
  hour = datetime.strptime(time, '%I:%M %p').hour
  if hour > 20:
      return "Night"
  elif hour > 16:
      return "Evening"
  elif hour > 11:
      return "Afternoon"
  elif hour > 4:
      return "Morning"
  else:
      return "Midnight"

pulse_point_df["time_of_the_day"] = pulse_point_df.timestamp_time.apply(lambda time: time_range(time))