GitHub - RitRa/MachineLearning-project: Machine Learning and statistics project 2019, Boston House Prices dataset

Boston House Prices dataset, Neural Network

Machine Learning and statistics project 2019

The boston dataset was first published in 1978 in a paper Hedonic Housing Prices and the Demand for Clean Air by Harrison and Rubenfield. 506 entries represent aggregated data about 14 features for homes in Boston and capture the Crime rate (CRIM), number of Rooms(RM), age(AGE) of owners and more. This dataset is widely used in many machine learning papers that address regression problems.

Aims of this notebook:

Use descriptive statistics and plots to describe the Boston House Prices dataset.
Use inferential statistics to analyse whether there is a significant difference in median house prices between houses that are along the Charles river and those that aren’t. You should explain and discuss your findings within the notebook.
Use keras to create a neural network that can predict the median house price based on the other variables in the dataset.

Quick Glance

I recommend using nbviewer to view this file: https://nbviewer.jupyter.org/github/RitRa/MachineLearning-project/blob/master/Machine%20Learning%20and%20Statistics%20Project%202019.ipynb

This Project concerns Boston House Prices dataset and the Python packages Scipy, Keras, and Jupyter.

I recommend installing jupyter using the anaconda distribution to run this Project

Dataset used

importing the boston housing dataset

from sklearn.datasets import load_boston
boston_df = load_boston()

Libraries used in this Jupyter Notebook include:

Pandas: Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
NumPy: NumPy is the fundamental package for scientific computing with Python.
Matplotlib: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
researchpy: produces Pandas DataFrames that contain relevant statistical testing information that is commonly required for academic research.

conda install -c researchpy researchpy

statsmodels: is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

conda install -c anaconda statsmodels

ipywidgets

conda install -c conda-forge ipywidgets

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.

conda install -c conda-forge keras

import libraries

import numpy as np
import pandas as pd
# charts
import seaborn as sns
import matplotlib.pyplot as plt
# for creating folder for plots
import os
# statistical analysis
import researchpy as rp
import statsmodels.api as sm
#for interactive widgets for charts
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
# machine learning
from keras.models import Sequential
from keras.layers import Dense

A Quick look at the data

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

1. Descriptive statistics

describe() gives us a quick overview of the dataset

summary = boston.describe()
summary = summary.transpose()
summary.head(14)

count	mean	std	min	25%	50%	75%	max
CRIM	506.0	3.613524	8.601545	0.00632	0.082045	0.25651	3.677083
ZN	506.0	11.363636	23.322453	0.00000	0.000000	0.00000	12.500000
INDUS	506.0	11.136779	6.860353	0.46000	5.190000	9.69000	18.100000
CHAS	506.0	0.069170	0.253994	0.00000	0.000000	0.00000	0.000000
NOX	506.0	0.554695	0.115878	0.38500	0.449000	0.53800	0.624000
RM	506.0	6.284634	0.702617	3.56100	5.885500	6.20850	6.623500
AGE	506.0	68.574901	28.148861	2.90000	45.025000	77.50000	94.075000
DIS	506.0	3.795043	2.105710	1.12960	2.100175	3.20745	5.188425
RAD	506.0	9.549407	8.707259	1.00000	4.000000	5.00000	24.000000
TAX	506.0	408.237154	168.537116	187.00000	279.000000	330.00000	666.000000
PTRATIO	506.0	18.455534	2.164946	12.60000	17.400000	19.05000	20.200000
B	506.0	356.674032	91.294864	0.32000	375.377500	391.44000	396.225000
LSTAT	506.0	12.653063	7.141062	1.73000	6.950000	11.36000	16.955000
MEDV	506.0	22.532806	9.197104	5.00000	17.025000	21.20000	25.000000

Let's look at the correlation between the variables in the dataset

Positive Correlation: both variables change in the same direction (light color).
Neutral Correlation: No relationship in the change of the variables.
Negative Correlation: variables change in opposite directions (dark color).

From the correlation heatmap:

We can see a positive correlation between MEDV and RM at 0.69. When the average number of rooms increase the price of the house also increases.
Negative correlation between MEDV and LSTAT (% lower status of the population): -0.76
Negative correlation between MEDV and PTRatio (pupil-teacher ratio by town): (-0.52)
Negative correlation between MEDV and INDUS (proportion of non-retail business acres per town) (-0.6)

Let's plot these for more detail:

2. Inferential statistics

Use inferential statistics to analyse whether there is a significant difference in median house prices between houses that are along the Charles river and those that aren’t.

Visualise the spread of data add box plots here of river vs non river

T-Test

Using a T-Test we can find out if there is a statistical significance between th two samples. I will use the researchpy library for this section.

descriptives, results = rp.ttest(other_df['MEDV'], riverhouse_df['MEDV'])

descriptives

Variable	N	Mean	SD	SE	95% Conf.	Interval
MEDV	461.0	21488.503254	7898.848164	367.886036	20765.557728	22211.448780
MEDV	29.0	23979.310345	7024.161328	1304.354013	21307.462269	26651.158421
combined	490.0	21635.918367	7865.301063	355.318083	20937.779775	22334.056960

This gives us a good overview of the the two samples. We can see that there is not a large different in the Means of each dataset and also the Standard deviations are similar.

Independent	t-test	results
0	Difference (MEDV - MEDV) =	-2490.8071
1	Degrees of freedom =	488.0000
2	t =	-1.6571
3	Two side test p value =	0.0981
4	Difference > 0 p value =	0.0491
5	Difference < 0 p value =	0.9509
6	Cohen's d =	-0.3172
7	Hedge's g =	-0.3168
8	Glass's delta =	-0.3153
9	r =	0.0748

Results: According to the pvalue which is greater than 0.05, there is no stastistical significance. Prior to removing the duplicates, the results were showing a pvalue less than 0.05.

3. Predict

We will use keras to create a Neural network that can predict the median house price based on the other variables in the dataset. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns.

import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense

Selecting features

We will only take the variables that had a strong correlations with 'MEDV'. 'MEDV', will be the variable we seek to predict.

#features = boston.iloc[:,0:13]
features = boston[['RM', 'LSTAT', 'PTRATIO', 'INDUS']]

# target is the price, boston[['MEDV']]
prices = boston.iloc[:,13]

Normalising the data

A best practices are to normalise the data before added it to the Neural network to see better results.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

features= scaler.fit_transform(features)
prices= prices.values.reshape(-1,1)

prices = scaler.fit_transform(prices)

Create train and test split

Now we need to split the data into two subsets. Train will be used inside the neural network.

from sklearn.model_selection import train_test_split

# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.20, random_state=50)

Create the model

With our training and test data set-up, we are now ready to build our model.

We create the model using 1 input layer, 2 hidden layers and 1 output layer.

As this a regression problem, the loss function we use is mean squared error and the metrics against which we evaluate the performance of the model is mean absolute error and accuracy. Also the output layer should be linear also.

from keras.callbacks import ModelCheckpoint
model = kr.models.Sequential()

n_cols = X_train.shape[1]
# The Input Layer, activation layer
model.add(kr.layers.Dense(128, kernel_initializer='normal',input_shape=(n_cols,), activation='relu'))
# The Hidden Layers :
#m.add(kr.layers.Dense(256, kernel_initializer='normal',activation='relu'))
model.add(kr.layers.Dense(256, kernel_initializer='normal',activation='relu'))
model.add(kr.layers.Dense(128, kernel_initializer='normal',activation='relu'))

# The Output Layer :
model.add(kr.layers.Dense(1, kernel_initializer='normal',activation='linear'))

print(model.summary())

Compile the Neural Network

model.compile(loss='mse', optimizer='adam', metrics=['mean_absolute_error', 'accuracy'])

Train the Model

from keras.callbacks import History 
history = History()
#  y_train = prices, use validation data 10%
hist = model.fit(X_train, y_train, validation_split=0.20, epochs=100, batch_size=6, callbacks=[history])

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
Machine Learning and Statistics Project 2019.ipynb		Machine Learning and Statistics Project 2019.ipynb
README.md		README.md
model.h5		model.h5
model.json		model.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boston House Prices dataset, Neural Network

Aims of this notebook:

Quick Glance

importing the boston housing dataset

import libraries

A Quick look at the data

1. Descriptive statistics

2. Inferential statistics

T-Test

3. Predict

Selecting features

Normalising the data

Create train and test split

Create the model

Compile the Neural Network

Train the Model

Plots

About

Releases

Packages

Languages

License

RitRa/MachineLearning-project

Folders and files

Latest commit

History

Repository files navigation

Boston House Prices dataset, Neural Network

Aims of this notebook:

Quick Glance

importing the boston housing dataset

import libraries

A Quick look at the data

1. Descriptive statistics

2. Inferential statistics

T-Test

3. Predict

Selecting features

Normalising the data

Create train and test split

Create the model

Compile the Neural Network

Train the Model

Plots

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages