Car Crash Prediction in NZ - Machine Learning Pipeline¶

1. Introduction¶

The dataset we use for this post is New Zealand Crash Analysis Dataset which is updated on a quarterly basis by the Transport Agency. The dataset was last updated on October 2018 (from January 2000). It contains all traffic crashes as reported to the Transport Agency by the NZ police. However, not all crashes are reported NZ police. A big portion of minor car crashes are settled on site by the parties without reporting to the police. The level of reporting increases with the severity of the crash. Due to the nature of non-fatal crashes it is believed that these are under-reported.

The dataset is available in many different formats and APIs. It is easy to grab them through API interfaces, instead of downloading to your local machine. In this way, we can have the newest version of the dataset every time we run it. If you want to download the dataset to your local machine, it can be found here. The crashes shown in the dataset are strongly related to geography, so we will use the geojson format. So, we can also perform geographic data analysis without creating geometries from latitude and longtitude and deal with coordinate reference systems and projections.

And we will also use different machine learning models to predict future car crashes based on the available data. Now, let's get into it.

This post is a self-learning project, and the original content can be found here

2. Exploratory Data Analysis (EDA)¶

Import Libraries

In [1]:

# Import libraries
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import folium
import requests
import seaborn as sns

Import Data

Import the dataset through API. This process can take a while.

In [2]:

url = 'https://opendata.arcgis.com/datasets/a163c5addf2c4b7f9079f08751bd2e1a_0.geojson'
geojson = requests.get(url).json()
crs = {'init': 'epsg:3851'} #Coordinate reference system for New Zealand
gdf = gpd.GeoDataFrame.from_features(geojson['features'], crs = crs)
print(gdf.shape)
gdf.head()

(665847, 88)

Out[2]:

	OBJECTID	Pedestrian	areaUnitID	carStationWagon	cliffBank	...	urban	vanOrUtility	weatherA	weatherB
0	3001	1	514101	1	0	...	Urban	0	Fine	Unknown
1	3002	1	514102	0	0	...	Urban	0	Fine	Unknown
2	3003	1	521802	1	0	...	Urban	0	Fine	Unknown
3	3004	0	508801	0	1	...	Urban	1	Fine	Unknown
4	3005	1	508411	1	0	...	Urban	0	Fine	Unknown

5 rows × 88 columns

In [3]:

gdf.columns

Out[3]:

Index(['OBJECTID', 'Pedestrian', 'advisorySpeed', 'animals', 'areaUnitID',
       'bicycle', 'bridge', 'bus', 'carStationWagon', 'cliffBank',
       'cornerRoadSideRoad', 'crashDirectionDescription', 'crashDistance',
       'crashFinancialYear', 'crashLocation1', 'crashLocation2',
       'crashRPDirectionDescription', 'crashRPDisplacement',
       'crashRPNewsDescription', 'crashRPRS', 'crashRPSH',
       'crashSHDescription', 'crashSeverity', 'crashYear', 'darkLight',
       'debris', 'directionRoleDescription', 'ditch', 'easting', 'fatalCount',
       'fence', 'flatHill', 'geometry', 'guardRail', 'holiday',
       'houseBuilding', 'intersection', 'intersectionMidblock', 'junctionType',
       'kerb', 'light', 'meshblockID', 'minorInjuryCount', 'moped',
       'motorcycle', 'multiVehicle', 'northing', 'numberOfLanes',
       'objectThrownOrDropped', 'other', 'otherVehicleType',
       'outdatedLocationDescription', 'overBank', 'parkedVehicle',
       'phoneBoxEtc', 'postOrPole', 'regionDesc', 'roadCharacter',
       'roadCurvature', 'roadLane', 'roadMarkings', 'roadSurface', 'roadWet',
       'roadworks', 'schoolBus', 'seriousInjuryCount', 'slipFlood',
       'speedLimit', 'strayAnimal', 'streetLight', 'suv', 'taxi',
       'temporarySpeedLimit', 'tlaID', 'tlaName', 'trafficControl',
       'trafficIsland', 'trafficSign', 'train', 'tree', 'truck',
       'unknownVehicleType', 'urban', 'vanOrUtility', 'vehicle', 'waterRiver',
       'weatherA', 'weatherB'],
      dtype='object')

Important Crash Information¶

Crash information on death and injuries are very important to us. We will first explore the dataset on this part of information.

crashSeverity: the severity of a car crash. Possible values are 'F'-fatal, 'S'-serious, 'M'-minor, 'N'-non-injury. This is determined by the worst injury sustained in the crash at the time of entry.

seriousInjuryCount: the number of serious injuries associated with a crash.

minorInjuryCount: the number of minor injuries associated with a crash.

fatalCount: the number of fatal casualties associated with a crash.

In [4]:

start_year = gdf.crashYear.min()
end_year = gdf.crashYear.max()

fatal = gdf['fatalCount'].sum()
serious = gdf['seriousInjuryCount'].sum()
minor = gdf['minorInjuryCount'].sum()

print("The total death in car crash accident in New Zealand from {} to {} is {}.".format(start_year,
                                                                                       end_year, fatal))
print("While the total number of serious injuries and minor injuries in car crashes is {}, and {}, respectively.".format(
                                    serious, minor))

The total death in car crash accident in New Zealand from 2000 to 2018 is 6922.
While the total number of serious injuries and minor injuries in car crashes is 45044, and 205895, respectively.

In [5]:

import warnings
warnings.simplefilter('ignore')

fig, ax = plt.subplots(1, 3, figsize = (25, 5))
sns.barplot(x = 'crashYear', y = 'fatalCount', data = gdf, ax = ax[0], color = 'k')
ax[0].set_title("The number of total death from year 2000 to 2018", fontsize = 14)

sns.barplot(x = 'crashYear', y = 'seriousInjuryCount', data = gdf, ax = ax[1], color = 'r')
ax[1].set_title("The number of total serious injuries from year 2000 to 2018", fontsize = 14)

sns.barplot(x = 'crashYear', y = 'minorInjuryCount', data = gdf, ax = ax[2], color = 'g')
ax[2].set_title("The number of total minor injuries from year 2000 to 2018", fontsize = 14)

[ax[i].set_xlabel('Crash Year') for i in range(3)]
plt.tight_layout()
plt.show()

In [6]:

fig, ax = plt.subplots(1, 3, figsize = (12,4))
sns.countplot(x = 'fatalCount', data = gdf, ax = ax[0], color = 'k')
sns.countplot(x = 'seriousInjuryCount', data = gdf, ax = ax[1], color = 'r')
sns.countplot(x = 'minorInjuryCount', data = gdf, ax = ax[2], color = 'g')
plt.tight_layout()
plt.show()

In [7]:

fig, ax = plt.subplots(figsize = (10, 4))
sns.countplot(y="crashSeverity",data=gdf, order='NMSF',ax = ax) # Non-injury, Minor, Seriours, Fatal
plt.show()

In [8]:

fig, ax = plt.subplots(1, 3, figsize = (25, 4))
sns.lineplot(x = 'crashYear', y = 'fatalCount', data = gdf, ax = ax[0])
ax[0].set_title("The number of total death from year 2000 to 2018", fontsize = 14)

sns.lineplot(x = 'crashYear', y = 'seriousInjuryCount', data = gdf, ax = ax[1])
ax[1].set_title("The number of total serious injuries from year 2000 to 2018", fontsize = 14)

sns.lineplot(x = 'crashYear', y = 'minorInjuryCount', data = gdf, ax = ax[2])
ax[2].set_title("The number of total minor injuries from year 2000 to 2018", fontsize = 14)

[ax[i].set_xlabel('Crash Year') for i in range(3)]
plt.tight_layout()
plt.show()

Road Information¶

These variables, including roadworks, junctionType, roadCharacter, roadCurvature, roadLane, roadMarkings, roadSurface, roadWet, numberOfLanes, intersection, intersectionMidblock, flatHill, darkLight, provide information about the road condition when an accident occurs.

Let's plot some of them out to see their patterns.

In [9]:

fig, ax = plt.subplots(3, 3, figsize = (20, 10))

sns.countplot(x='roadCurvature',data=gdf, ax=ax[0,0])
sns.countplot(x="junctionType",data=gdf,  ax=ax[0,1])
sns.countplot(x='roadLane',data=gdf, ax=ax[0,2])
sns.countplot(x='numberOfLanes',data=gdf, ax=ax[1,0])
sns.countplot(x='roadWet',data=gdf, ax=ax[1,1])
sns.countplot(x='roadMarkings',data=gdf,  ax=ax[1,2])
sns.countplot(x='roadSurface',data=gdf, ax=ax[2,0])
sns.countplot(x='darkLight',data=gdf, ax=ax[2,1])
sns.countplot(x='intersection',data=gdf, ax=ax[2,2])
#intersectionMidblock
plt.tight_layout()

NumberOfLanes influence on fatal and serious injuries.

In [10]:

plt.figure(figsize = (12,5))
sns.barplot(x = 'numberOfLanes', y = 'fatalCount', data = gdf, palette = 'Reds_d')
plt.figure(figsize = (12, 5))
sns.barplot(x = 'numberOfLanes', y = 'seriousInjuryCount', data = gdf, palette = 'Blues_d')
plt.show()

JunctionType influence on fatal and serious injuries.

In [11]:

fig, ax = plt.subplots(1, 2, figsize = (12, 4))
sns.barplot(x= 'junctionType', y= 'fatalCount', data = gdf, palette = 'Reds_d', ax = ax[0])
sns.barplot(x= 'junctionType', y= 'seriousInjuryCount', data = gdf, palette = 'Blues_d', ax = ax[1])
plt.tight_layout()
plt.show()

RoadWet influence on fatal and serious injuries.

In [12]:

fig, ax = plt.subplots(1, 2, figsize = (8, 4))
sns.barplot(x= 'roadWet', y= 'fatalCount', data = gdf, palette = 'Reds_d', ax = ax[0])
sns.barplot(x= 'roadWet', y= 'seriousInjuryCount', data = gdf, palette = 'Blues_d', ax = ax[1])
plt.tight_layout()
plt.show()

DarkLight influence. Dark Light condition also has a significant influence on fatal and serious injury accident, observed from the figure shown below.

In [13]:

fig, ax = plt.subplots(1, 2, figsize = (8, 4))
sns.barplot(x= 'darkLight', y= 'fatalCount', data = gdf, palette = 'Reds_d', ax = ax[0])
sns.barplot(x= 'darkLight', y= 'seriousInjuryCount', data = gdf, palette = 'Blues_d', ax = ax[1])
plt.tight_layout()
plt.show()

RoadSurface influence. From the figure shown below, it can be easily observed that unsealed road surface has a much higher fatal or serious injury rate than sealed roads.

In [14]:

fig, ax = plt.subplots(1, 2, figsize = (8, 4))
sns.barplot(x= 'roadSurface', y= 'fatalCount', data = gdf, palette = 'Reds_d', ax = ax[0])
sns.barplot(x= 'roadSurface', y= 'seriousInjuryCount', data = gdf, palette = 'Blues_d', ax = ax[1])
plt.tight_layout()
plt.show()

Traffic Conditions¶

SpeedLimit with fatal and serious injuries.

In [15]:

fig, ax = plt.subplots(1, 2, figsize = (20, 4))
sns.barplot(x= 'speedLimit', y= 'fatalCount', data = gdf, palette = 'Reds_d', ax = ax[0])
sns.barplot(x= 'speedLimit', y= 'seriousInjuryCount', data = gdf, palette = 'Blues_d', ax = ax[1])
plt.tight_layout()
plt.show()

StreetLight with fatal and serious injuries.

In [16]:

fig, ax = plt.subplots(1, 2, figsize = (20, 4))
sns.barplot(x= 'streetLight', y= 'fatalCount', data = gdf, palette = 'Reds_d', ax = ax[0])
sns.barplot(x= 'streetLight', y= 'seriousInjuryCount', data = gdf, palette = 'Blues_d', ax = ax[1])
plt.tight_layout()
plt.show()

AdvisorySpeed with fatal and serious injuries.

In [17]:

sns.catplot(x="advisorySpeed", y="crashSeverity",data=gdf, kind= 'bar');
plt.show()

Weather Conditions¶

Total Weather Counts, weatherA and weatherB. And their influence on fatal and serious injuries.

In [18]:

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.countplot(x="weatherA",data=gdf, ax = ax[0], palette="Greens_d");
sns.countplot(x="weatherB",data=gdf, ax = ax[1], palette="Greens_d");
plt.tight_layout()
plt.show()

In [19]:

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.barplot(x="weatherA", y = 'fatalCount', data=gdf, ax = ax[0], palette="Reds_d");
sns.barplot(x="weatherB",y = 'seriousInjuryCount', data=gdf, ax = ax[1], palette="Blues_d");
plt.tight_layout()
plt.show()

Geographic Data Exploration¶

In [20]:

gdf.plot(markersize = 0.01, edgecolor = 'r', figsize = (12,12))
plt.axis('off')

Out[20]:

(166.67652523873306,
 179.10957030210642,
 -47.52866483091503,
 -33.806398085455236)

In [21]:

from folium.plugins import MarkerCluster
gdf_sample = gdf.sample(5000)
lons = gdf_sample.geometry.x
lats = gdf_sample.geometry.y

m = folium.Map(location = [np.mean(lats), np.mean(lons)],
              zoom_start =5.7)

#FastMarkerCluster(data=list(zip(lats, lons))).add_to(m)
MarkerCluster(list(zip(lats, lons))).add_to(m)

folium.LayerControl().add_to(m)
m

Out[21]:

In [22]:

gdf_sample = gdf.sample(5000)
lons = gdf_sample.geometry.x
lats = gdf_sample.geometry.y
heat_cols = list(zip(lats, lons))
from folium.plugins import HeatMap

m = folium.Map([np.mean(lats), np.mean(lons)], 
               tiles='CartoDB dark_matter', 
               zoom_start=7)

HeatMap(heat_cols).add_to(m)
m

Out[22]:

Data Modelling¶

We can approach the machine learning part in different ways. We can consider it as a regression problem and predict the number of death or serious injuries. We can also take it as a classification problem and classify the severity of a certain car crash. In this post, we will first approach it as a regression problem. We will try to predict the the fatalCount based on all other variables. Random Forest will be used the the modelling algorithm. It needs only minimal data preprocessing and does well in many cases.

In [23]:

# import libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# first tranfer our data from geoDataFrame to pandas DataFrame, so that we can use sklearn to work on it.
df = pd.DataFrame(gdf.drop(['geometry', 'OBJECTID'], axis = 1))
df.head()

Out[23]:

	Pedestrian	areaUnitID	carStationWagon	cliffBank	cornerRoadSideRoad	...	urban	vanOrUtility	weatherA	weatherB
0	1	514101	1	0	1	...	Urban	0	Fine	Unknown
1	1	514102	0	0	1	...	Urban	0	Fine	Unknown
2	1	521802	1	0	1	...	Urban	0	Fine	Unknown
3	0	508801	0	1	1	...	Urban	1	Fine	Unknown
4	1	508411	1	0	1	...	Urban	0	Fine	Unknown

5 rows × 86 columns

In [24]:

gdf.head()

Out[24]:

	OBJECTID	Pedestrian	areaUnitID	carStationWagon	cliffBank	...	urban	vanOrUtility	weatherA	weatherB
0	3001	1	514101	1	0	...	Urban	0	Fine	Unknown
1	3002	1	514102	0	0	...	Urban	0	Fine	Unknown
2	3003	1	521802	1	0	...	Urban	0	Fine	Unknown
3	3004	0	508801	0	1	...	Urban	1	Fine	Unknown
4	3005	1	508411	1	0	...	Urban	0	Fine	Unknown

5 rows × 88 columns

Regression Modeling with Random forest¶

We first need to convert categorical variables into numerical variables, using LabelEncoder and OneHotEncoder.

In [25]:

from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
for i in df:
    if df[i].dtype =='object':
        LE.fit(df[i])
        df[i] = LE.transform(df[i])
        
df = pd.get_dummies(df)
df.head()

Out[25]:

	Pedestrian	areaUnitID	carStationWagon	cliffBank	cornerRoadSideRoad	...	urban	vanOrUtility	weatherB
0	1	514101	1	0	1	...	2	0	2
1	1	514102	0	0	1	...	2	0	2
2	1	521802	1	0	1	...	2	0	2
3	0	508801	0	1	1	...	2	1	2
4	1	508411	1	0	1	...	2	0	2

5 rows × 86 columns

In [26]:

# split data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('fatalCount', axis = 1), df.fatalCount, 
                                                   test_size = 0.2, random_state = 42)
print((X_train.shape, X_test.shape, y_train.shape, y_test.shape))

((532677, 85), (133170, 85), (532677,), (133170,))

In [36]:

def rmse(x,y):
    return np.sqrt(((x-y)**2).mean())

def print_score(m):
    res =[rmse(m.predict(X_train), y_train),
         rmse(m.predict(X_test), y_test),
         m.score(X_train,y_train),
        m.score(X_test, y_test)]
    if hasattr(m, 'oob_score_'):
        res.append(m.oob_score_)
    print(res)

In [28]:

# Initialize the regressor
m = RandomForestRegressor(n_estimators = 50)
m.fit(X_train, Y_train)
print_score(m)

[0.017907117808386587, 0.04273625304671058, 0.9766110961503995, 0.8620394418555027]

Feature Importance

In [29]:

f_imp = pd.DataFrame(data={'importance':m.feature_importances_,'features':X_train.columns}).set_index('features')
f_imp = f_imp.sort_values('importance', ascending = False)
f_imp.head(10)

Out[29]:

	importance
features
crashSeverity	0.855630
crashLocation2	0.010599
easting	0.009086
crashDistance	0.007509
crashRPDisplacement	0.007109
crashRPRS	0.006428
crashLocation1	0.006199
seriousInjuryCount	0.005648
northing	0.005389
meshblockID	0.004877

Clearly, crashSeverity is the most important in our model, because it is derived from the number of deaths.

In [30]:

f_imp[:20].plot(kind='bar', figsize=(12,10))
plt.show()

Select the variables with importance.

In [31]:

f_imp_sel = f_imp[f_imp['importance'] > 0.0001]

In [32]:

len(f_imp_sel.index), len(f_imp.index)

Out[32]:

(73, 85)

We have 73 out of 85 variables with importance value over 0.0001.

In [33]:

y = df.fatalCount
df_imp = df[f_imp_sel.index]
df_imp.shape

Out[33]:

(665847, 73)

Retrain the RandomForestRegressor with only the variables with significance.

In [34]:

# Let us split our data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(df_imp, y, test_size=0.33, random_state=42)

m = RandomForestRegressor(n_estimators=50, oob_score=True)
m.fit(X_train, y_train)

Out[34]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
           oob_score=True, random_state=None, verbose=0, warm_start=False)

In [37]:

print_score(m)

[0.017239725389319133, 0.04320178481356111, 0.978469926234861, 0.8589594053214539, 0.8404163772494878]

Adjusting the parameters of the RF regressor to see if we can get better results. Of course, you can also use a Grid Search Cross-validation to find more suitable parameters. Here, to make it easy, I will not go through this procedure. Because it may take a long time.

In [38]:

m = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

[0.03777296689755251, 0.04108996727501145, 0.8966413028101987, 0.8724112618047403, 0.8542367058273785]