Project2018-iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems" as an example of linear discriminant analysis. This famous iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

The dataset contains a set of 150 records under 5 attributes -

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
Species: -- Iris Setosa -- Iris Versicolour -- Iris Virginica

Libraries Used

Importing the libaries for this project: Pandas, Numpy, Holoviews.

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.

NumPy is the fundamental package for scientific computing with Python

HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple.

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

I also used the Jupyter Notebook for this project.

import pandas as pd
import numpy as np
import seaborn as sns

Data Import

Import the iris.csv using the panda library and examine first few rows of data

iris_data = pd.read_csv('assets/iris.csv')
iris_data.columns = ['sepal_length', 'sepal_width' , 'petal_length', 'petal_width', 'species']

#you can specific the number to show here
iris_data.head(10)

Discovering the Shape of the table

Find out what the size of rows and columns in the table

iris_data.shape

Find out unique classification/type of iris flower and the amount

iris_data['species'].unique()
print(iris_data.groupby('species').size())

'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'

Investigating the data

Min, Max, Mean, Median and Standard Deviation

iris_data.min()
iris_data.max()
iris_data.mean()
iris_data.median()
iris_data.std()

Summary Statistics Table

This statistics table is a much nicer, cleaner way to present the data. We can see there is huge range in the size of the Sepal Length and Petal Length. We will use box plots and scatter plots to see if the size is related to the species of Iris.

summary = iris_data.describe()
summary = summary.transpose()
summary.head()

Boxplots

The boxplot is a quick way of visually summarizing one or more groups of numerical data through their quartiles. Comparing the distributions of:

Sepal Length
Sepal Width
Petal Length
Petal Width

From the Boxplot, we can see that there are distinct differences between the Petal Length, Petal Width and Sepal Length across the Species.

Scatterplots

Here we can use to variables to show that there is distinct difference in sizes between the species. Firstly, we look at the Petal width and Petal length across the species. Is it clear to see that the iris Setosa has a significantly smaller petal width and petal length than the other two species. This difference occurs again for the Petal width and Sepal length. And in both cases we can see that the Iris Viginica is the largest species.

Pairplot

This chart enables us to quickly see the relationships between variables across multiple dimensions usings scatterplots and histograms.

Plotting regression and confidence intervals

Use kernel density estimates for univariate plots

Swarm plot

Violin plot

A voilin plot is used to visualise the distribution of the data and its probability density. The thick black bar in the center represents the interquartile range, the thin black line extended from it represents the 95% confidence intervals, and the white dot is the median.

Petal Length

Petal Width

Box plot

Machine Learning using scikit-learn

Using the Scikit-learn library we can perform machine learning on the dataset. As this is my first step into machine learning I have heavily relied on the tutorials below for help.

What is Scikit-learn?

It is a free machine learning library for python. It features various classification, regression and clustering algorithms. Built on Numpy and Scipy. For this project, I will use the powerful classification algorithm, K-Nearest-Neighbors (KNN) to perform supervised learning.

As the dataset is already import into scikit-learn, I will reuse it. Here are the steps:

Import Data
Investigate the Data
Perform supervised Learning with K-Nearest-Neighbors (KNN)
Fitting the model
Predict the response

This data is four-dimensional, but we can visualize two of the dimensions at a time using a scatter plot:

Using supervised learning with K-Nearest Neighbours(KNN), we are able to ask the algorithm "Based on these measurements, what is the species?"

Question: What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?

knn.predict([[3, 5, 4, 2]])

answer:['virginica']

A plot of the sepal space and the prediction of the KNN

References

Background info https://en.wikipedia.org/wiki/Iris_flower_data_set https://archive.ics.uci.edu/ml/datasets/iris

Summary values https://stackoverflow.com/questions/33889310/r-summary-equivalent-in-numpy

R iris project https://rstudio-pubs-static.s3.amazonaws.com/205883_b658730c12d14aa6996fe2f6c612c65f.html

python iris project https://rajritvikblog.wordpress.com/2017/06/29/iris-dataset-analysis-python/

min value http://www.datasciencemadesimple.com/get-minimum-value-column-python-pandas/

A histogram with Iris Dataset: Sora Jin June 21st, 2015 https://rpubs.com/Sora/developing-data-product

Plot 2D views of the iris dataset http://www.scipy-lectures.org/packages/scikit-learn/auto_examples/plot_iris_scatter.html

Statistics in Python http://www.scipy-lectures.org/packages/statistics/index.html#statistics

Python - IRIS Data visualization and explanation https://www.kaggle.com/abhishekkrg/python-iris-data-visualization-and-explanation

Visualization with Seaborn (Python) https://www.kaggle.com/rahulm7/visualization-with-seaborn-python

Iris Data Visualization using Python https://www.kaggle.com/aschakra/iris-data-visualization-using-python

Seaborn Understanding the Weird Parts: pairplot https://www.youtube.com/watch?v=cpZExlOKFH4

Docs https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

http://holoviews.org/gallery/demos/bokeh/boxplot_chart.html

Machine Learning Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html

http://www.scipy-lectures.org/packages/scikit-learn/index.html

https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/

https://machinelearningmastery.com/machine-learning-in-python-step-by-step/

https://github.com/whatsrupp/iris-classification/blob/master/petal_classifier.py

https://diwashrestha.com/2017/09/18/machine-learning-on-iris/

https://www.youtube.com/watch?v=rNHKCKXZde8

http://seaborn.pydata.org/examples/scatterplot_categorical.html

IRIS DATASET ANALYSIS (PYTHON) http://d4t4.biz/ml-with-scikit-learn/support-vector-machines-project-wip/

Getting started in scikit-learn with the famous iris dataset https://www.youtube.com/watch?v=hd1W4CyPX58 http://blog.kaggle.com/2015/04/22/scikit-learn-video-3-machine-learning-first-steps-with-the-iris-dataset/

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
Project+2018+-+Fishers+Iris+data+set+analysis.ipynb		Project+2018+-+Fishers+Iris+data+set+analysis.ipynb
Project+2018+-+Fishers+Iris+data+set+analysis.py		Project+2018+-+Fishers+Iris+data+set+analysis.py
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project2018-iris

Libraries Used

Data Import

Discovering the Shape of the table

Find out unique classification/type of iris flower and the amount

Investigating the data

Summary Statistics Table

Boxplots

Scatterplots

Pairplot

Swarm plot

Violin plot

Box plot

Machine Learning using scikit-learn

What is Scikit-learn?

Using supervised learning with K-Nearest Neighbours(KNN), we are able to ask the algorithm "Based on these measurements, what is the species?"

References

About

Releases

Packages

Languages

License

RitRa/Project2018-iris

Folders and files

Latest commit

History

Repository files navigation

Project2018-iris

Libraries Used

Data Import

Discovering the Shape of the table

Find out unique classification/type of iris flower and the amount

Investigating the data

Summary Statistics Table

Boxplots

Scatterplots

Pairplot

Swarm plot

Violin plot

Box plot

Machine Learning using scikit-learn

What is Scikit-learn?

Using supervised learning with K-Nearest Neighbours(KNN), we are able to ask the algorithm "Based on these measurements, what is the species?"

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages