-
Notifications
You must be signed in to change notification settings - Fork 1
4.EDA
A quick overview of the preprocessed data-
The preprocessed dataset contains additional 5 columns extracted from the location column, another 5 columns extracted from date_of_incident and duration columns. Id , Incident_logo and agency_logo columns from the original dataset was discarded.
Columns | Description | Data Type |
---|---|---|
business | Name of the business place extracted from location(e.g., JANIE & JACK, DOLLAR GENERAL, etc.) | object |
address | Address where the incident took place (extracted from location) | object |
address_2 | Extended address where the incident took place (extracted from location) | object |
city | City where the incident occurred (extracted from location). It could also be a town or a country | object |
state | State where the incident took place (extracted from location) | object |
duration_in_seconds | Incident duration in seconds (extracted from duration) | numeric, int |
day_name | Name of the day when the incident took place | object |
weekday | The day of the week with Monday=0, Sunday=6. | object |
month_name | Name of the month (extracted from date) | object |
time_of_the_day | morning (5AM-11:59AM), afternoon (12PM-4:59 PM), evening (5PM-8:59PM), night (9PM-11:59PM), midnight (12AM-4:59AM) | object |
- To know more about incident types visit - https://www.pulsepoint.org/incident-types
- The Description (Unit codes) are Radio IDs of agency units, such as engines, chief officers, and ambulances assigned to the incident. To know more about unit codes, visit - https://www.pulsepoint.org/unit-status-legend
The codes themselves are defined by each agency, and are typically followed by a number to identify a particular instance of each asset type. A legend is sometimes provided on the agency information page, and following are some common examples:
- B=Battalion
- BC=Battalion
- Chief E=Engine
- CMD=Command
- CPT=Helicopter
- C=Crew
- DZR=Dozer
- HM=Hazmat
- ME=Medic Engine
- MRE=Medic Rescue Engine
- P=Patrol
- R=Rescue
- RE=Rescue Engine
- SQ=Squad
- T=Truck
- U=Utility
- WT=Water Tender
Credit: PulsePoint Wikipedia
Note: There is no standard for the identifier abbreviations (E, T, S, BC, RA, PM, etc.), and they can vary significantly from agency to agency.
Example - Ventura County Fire Department PulsePoint Unit Abbreviations PDF
To know more, visit - https://www.pulsepoint.org/unit-status-legend
Issues -
-
Some cities in different states have the same name. Examples -
BLOOMINGTON
inCA
orIN
state -
Some cities with the same names appear in two different countries. examples -
- NAPLES - Italy
- Columbia - Country in South America
- Suffolk - UK
- STAFFORD - UK
- NORFOLK - UK
Adding city and country names will help to get the appropriate location
from geopy.geocoders import Nominatim # reverse geocoding
geolocator = Nominatim(user_agent='myapplication')
def get_nominatim_geocode(address):
try:
location = geolocator.geocode(address)
return location.raw['lon'], location.raw['lat']
except Exception as e:
# print(e)
return None, None
# alternative way: scraping from the website
# def get_nominatim_geocode(address):
# url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) + '?format=json'
# try:
# response = requests.get(url).json()
# return response[0]["lon"], response[0]["lat"]
# except Exception as e:
# # print(e)
# return None, None
def get_positionstack_geocode(address):
BASE_URL = "http://api.positionstack.com/v1/forward?access_key="
API_KEY = API_KEY_POSITIONSTACK
url = BASE_URL +API_KEY+'&query='+urllib.parse.quote(address)
try:
response = requests.get(url).json()
# print( response["data"][0])
return response["data"][0]["longitude"], response["data"][0]["latitude"]
except Exception as e:
# print(e)
return None, None
def get_geocode(address):
long,lat = get_nominatim_geocode(address)
if long == None:
return get_positionstack_geocode(address)
else:
return long,lat
# example
address = "50TH ST S"
get_geocode(address)
from tqdm.auto import tqdm # for notebooks
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas() # https://stackoverflow.com/a/34365537/11105356
# for Canadian provinces
ca_province_dic = {
'Newfoundland and Labrador': 'NL',
'Prince Edward Island': 'PE',
'Nova Scotia': 'NS',
'New Brunswick': 'NB',
'Quebec': 'QC',
'Ontario': 'ON',
'Manitoba': 'MB',
'Saskatchewan': 'SK',
'Alberta': 'AB',
'British Columbia': 'BC',
'Yukon': 'YT',
'Northwest Territories': 'NT',
'Nunavut': 'NU',
}
canada_mask = pulse_point_city_df.state.isin([*ca_province_dic.values()])
pulse_point_city_df['location'] = pulse_point_city_df['city'] + ', ' + pulse_point_city_df['state']
pulse_point_city_df['location'].loc[canada_mask] = pulse_point_city_df['location'] + ', CANADA'
pulse_point_city_df['location'].loc[~canada_mask] = pulse_point_city_df['location'] + ', USA'
# to verify
# pulse_point_city_df[pulse_point_city_df['location'].str.endswith('USA')]
# pulse_point_city_df[pulse_point_city_df['location'].str.endswith('CANADA')]
# fetch geolocation
%%time
location_df = pulse_point_city_df.location.progress_apply(lambda x:get_geocode(str(x.strip()))).apply(pd.Series)
location_df.columns = ['longitude', 'latitude']
pulse_point_city_df = pulse_point_city_df.join(location_df) # pulse_point_city_df will be used later
Top 5 Cities by agency engagement -
Name | Count | State |
---|---|---|
1. LOS ANGELES | 7449 | CA |
2. MILWAUKEE | 4404 | WI |
3. COLUMBUS | 4115 | OH |
4. CLEVELAND | 3977 | OH |
5. ROCKFORD | 2950 | IL |
import folium
import geopandas
from folium.plugins import HeatMap
geometry = geopandas.points_from_xy(pulse_point_city_df.longitude, pulse_point_city_df.latitude)
geo_df = geopandas.GeoDataFrame(pulse_point_city_df[['city','count','longitude', 'latitude']], geometry=geometry)
map = folium.Map(location = [48, -102], tiles='Cartodb dark_matter', zoom_start = 4)
heat_data = [[point.xy[1][0], point.xy[0][0]] for point in geo_df.geometry ]
HeatMap(heat_data).add_to(map)
map
import folium
import geopandas
from folium.plugins import HeatMap
# to avoid recursion depth issue change latitude,longitude type to float
# https://github.com/python-visualization/folium/issues/1105
pulse_point_city_df['latitude'] = pulse_point_city_df['latitude'].astype(float)
pulse_point_city_df['longitude'] = pulse_point_city_df['longitude'].astype(float)
map_USA = folium.Map(location=[48, -102],
zoom_start=4,
prefer_canvas=True,
)
occurences = folium.map.FeatureGroup()
n_mean = pulse_point_city_df['count'].mean()
for lat, lng, number, city in zip(pulse_point_city_df['latitude'],
pulse_point_city_df['longitude'],
pulse_point_city_df['count'],
pulse_point_city_df['city']):
occurences.add_child(
folium.vector_layers.CircleMarker(
[lat, lng],
radius=number/(n_mean/3), # radius for number of occurrences
color='yellow',
fill=True,
fill_color='blue',
fill_opacity=0.4,
# tooltip = city
tooltip=str(number)+','+str(city)[:21], # can be displayed max 21 character
# most of the city names contain 5-20 characters
# check pulse_point_city_df.city.apply(len).plot();
# get more from tooltip https://github.com/python-visualization/folium/issues/1010#issuecomment-435968337
)
)
map_USA.add_child(occurences)
Top 5 States by agency engagement -
Name | Count | Abbreviation |
---|---|---|
1. California | 70989 | CA |
2. Florida | 23213 | FL |
3. Virginia | 16016 | VA |
4. Washington | 15532 | WA |
5. Ohio | 14440 | OH |
Animate geo-scatter plot
import folium
import pandas as pd
import numpy as np
import pdpipe as pdp
import plotly.express as px
df_state_incident = pulse_point_df.groupby(["date_of_incident",
"state"],
as_index=False).count()[['date_of_incident',
'state', 'title']].reset_index(drop=True).rename(columns={'date_of_incident':'date',
'title':'count'})
df_state_incident.columns = ['date', 'state', 'count']
# set the size of the geo bubble
def set_size(value):
'''
Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
Returns a number to indicate the size of a bubble for a country which numeric attribute value
was supplied as an input
'''
result = np.log(1+value)
if result < 0:
result = 0.1
return result
pipeline = pdp.PdPipeline([
pdp.ApplyByCols('count', set_size, 'size', drop=False),
])
agg_incident_data = pipeline.apply(df_state_incident)
agg_incident_data.fillna(0, inplace=True)
agg_incident_data = agg_incident_data.sort_values(by='date', ascending=True)
agg_incident_data.date = agg_incident_data.date.dt.strftime('%Y-%m-%d') # convert to string object
fig = px.scatter_geo(
agg_incident_data, locations="state", locationmode='USA-states',
scope="usa",
color="count",
size='size', hover_name="state",
range_color= [0, 2000],
projection="albers usa", animation_frame="date",
title='PulsePoint Incidents: Local Emergencies By State',
color_continuous_scale="portland"
)
fig.show()
# https://developers.google.com/public-data/docs/canonical/states_csv
state_coordinate = pd.read_html("https://developers.google.com/public-data/docs/canonical/states_csv")[0]
# US States with Total Incident Count
pulse_point_state_df = pulse_point_df.groupby(['state']).count()[['title']].reset_index().rename(columns={'title':'count'})
# Missing US States
state_coordinate[~state_coordinate.state.isin(pulse_point_state_df.state)].reset_index(drop=True)
# Filter US States
pulse_point_state_df = pulse_point_state_df.merge(state_coordinate, on='state', how='left')
# drop Canadian provinces
pulse_point_state_df.dropna(inplace=True)
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
state_data = pulse_point_state_df.iloc[:,[0,1]]
m = folium.Map(location=[48, -102], zoom_start=4)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["state", "count"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Number of Incidents",
).add_to(m)
folium.LayerControl().add_to(m)
m
# icon credit : https://icon-icons.com/icon/location-sos-phone-call-help/68848
# https://www.clipartmax.com/middle/m2H7i8G6N4H7b1N4_metallic-icon-royalty-free-cliparts-icone-sos-png/
# custom icon : https://stackoverflow.com/a/68992396/11105356
import folium
for i in range(0, len(pulse_point_state_df)):
folium.Marker(
location = [pulse_point_state_df.iloc[i]['latitude'], pulse_point_state_df.iloc[i]['longitude']],
popup = folium.Popup(f"{pulse_point_state_df.iloc[i]['name']}\n{pulse_point_state_df.iloc[i]['count']}", parse_html=True),
icon=folium.features.CustomIcon('https://i.postimg.cc/JhmnMQXj/sos.png', icon_size=(24, 31))
).add_to(m)
m
# https://plotly.com/python/choropleth-maps
fig = go.Figure(data=go.Choropleth(
locations=pulse_point_state_df['state'], # Spatial coordinates
z = pulse_point_state_df['count'].astype(float), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'Reds',
colorbar_title = "Total Occurrences",
))
fig.update_layout(
title_text = 'US PulsePoint Emergencies Occurrences by State',
geo_scope='usa', # limite map scope to USA
)
fig.show()
Top ten emergencies during 'Midnight' or 'Morning' -
Midnight:
- Medical Emergency
- Traffic Collision
- Fire Alarm
- Alarm
- Public Service
- Structure Fire
- Refuse/Garbage Fire
- Mutual Aid
- Residential Fire
- Expanded Traffic Collision
Morning:
- Medical Emergency
- Traffic Collision
- Fire Alarm
- Public Service
- Refuse/Garbage Fire
- Structure Fire
- Fire
- Residential Fire
- Mutual Aid
- Lift Assist
- Most of the incidents occurred in California
- Most incidents happened during midnight and in the morning throughout the week
- Most of the emergency engagement lasted under 30 mins
- The highest number of incidents happened on Sunday
- The incidents’ number got increased after Covid-19 lockdown
- Medical emergency was the highest occurring incident which was followed by traffic collision and fire alarm
- Montgomery County, Milwaukee Fire, and Columbus Fire were the top active agencies during the five monthly period