perceptron_notes.Rmd

---
title: "Notes on Perceptron"
author: "Faiyaz Hasan"
date: "July 19, 2016"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Introduction
------------

The Perceptron is a simple learning algorithm designed by Frank Rosenblatt. We'll get into the formalism in a bit. First consider an extreme example: A data set containing a list of weights and the corresponding binary labels saying either elephant or mouse depending on the weight. The perceptron would take in training dataset and eventually learn to identify the correct classes based on the weight of the object. Of course, certain condition need to be satisfied for the convergence properties of the model. 

Perceptron - Binary classification algorithm
--------------------------------------------

Here, we use the Iris data set. 

```{r, iris exploratory data analysis + clean up data frame}
# load iris data set
data(iris)

# subset of iris data frame - extract only species versicolor and setosa
# we will only focus on the sepal and petal lengths of the dataset
irissubdf <- iris[1:100, c(1, 3, 5)]
names(irissubdf) <- c("sepal", "petal", "species")
head(irissubdf)

# plot data - a picture is worth a 1000 words. Melt data => then ggplot
library(ggplot2)
ggplot(irissubdf, aes(x = sepal, y = petal)) + 
        geom_point(aes(colour=species, shape=species), size = 3) +
        xlab("sepal length") + 
        ylab("petal length") + 
        ggtitle("Species vs sepal and petal lengths")

# add binary labels corresponding to species - Initialize all values to 1
# add setosa label of -1. The binary +1, -1 labels are in the fourth  
# column. It is better to create two separate data frames: one containing
# the attributes while the other contains the class values.
irissubdf[, 4] <- 1
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1

x <- irissubdf[, c(1, 2)]
y <- irissubdf[, 4]

# head and tail of data 
head(x)
head(y)
```

In this section, let us simply implement the perceptron learning algorithm to this data. In the next section, we will work on the convergence of the weight factors and fiddling with the learning rate. 

```{r, perceptron algorithm}
# write function that takes in the data frame, learning rate - eta, and number of epochs - n.iter and updates the weight factor. At this stage, I am only conserned with the final weight and the number of epochs required for the weight to converge

perceptron <- function(x, y, eta, niter) {
        
        # initialize weight vector
        weight <- rep(0, dim(x)[2] + 1)
        errors <- rep(0, niter)
        
        
        # loop over number of epochs niter
        for (jj in 1:niter) {
                
                # loop through training data set
                for (ii in 1:length(y)) {
                        
                        # Predict binary label using Heaviside activation 
                        # function
                        z <- sum(weight[2:length(weight)] * 
                                         as.numeric(x[ii, ])) + weight[1]
                        if(z < 0) {
                                ypred <- -1
                        } else {
                                ypred <- 1
                        }
                        
                        # Change weight - the formula doesn't do anything 
                        # if the predicted value is correct
                        weightdiff <- eta * (y[ii] - ypred) * 
                                c(1, as.numeric(x[ii, ]))
                        weight <- weight + weightdiff
                        
                        # Update error function
                        if ((y[ii] - ypred) != 0.0) {
                                errors[jj] <- errors[jj] + 1
                        }
                        
                }
        }
        
        # weight to decide between the two species 
        print(weight)
        return(errors)
}

err <- perceptron(x, y, 1, 10)
plot(1:10, err, type="l", lwd=2, col="red", xlab="epoch #", ylab="errors")
title("Errors vs epoch - learning rate eta = 1")
```

Note that increasing the learning rate does exactly as its name suggests, the only problem being that the rate might not be sensitive enough to ensure convergence. While convergence is ensured, a larger learning rate (between 0 and 1) leads to a faster convergence.

How to implement a multiclass classification in the perceptron?
---------------------------------------------------------------

The iris data set we began with had three different species. We removed Virginica for the sake of binary classification. Now, using simple logic, I will extend the perceptron algorithm to deal with three species. 

The game plan is as follows:
* First create a data frame with sepal and petal lengths.  
* Label the corresponding species data as 1 for virginica and -1 for setosa **OR** versicolor.  
* The weight obtained after convergence will decide between virginica and not virginica.  
* Then, the weight from the previous section can be used to determine between setosa and versicolor.
* Possible issue that might crop up is that the sepal and petal lengths might not be enough to classify the species. So, begin by plotting it.

```{r, species vs sepal and petal length}
# iris data subset
irisdata <- iris[, c(1, 3, 5)]
names(irisdata) <- c("sepal", "petal", "species")

# ggplot the data
ggplot(irisdata, aes(x = sepal, y = petal)) + 
        geom_point(aes(colour=species, shape=species), size = 3) +
        xlab("sepal length") + 
        ylab("petal length") + 
        ggtitle("Species vs sepal and petal lengths")
```

From the plot, we can see that the data is not linearly separable based on these attributes. More attributes are needed. So, I'll keep all the attributes of the iris data set (sepal and petal lengths and widths) to see if the weight can converge or not.

```{r, virginica classification}
# subset of properties of flowers of iris data set
x <- iris[, 1:4] 
names(x) <- tolower(names(x))

# create species labels
y <- rep(-1, dim(x)[1])
y[iris[, 5] == "virginica"] <- 1

# compute and plot error
err <- perceptron(x, y, 0.01, 50)
plot(1:50, err, type="l", lwd=2, col="red", xlab="epoch #", ylab="errors")
title("Errors in differentiating Virginica vs epoch - learning rate eta = 0.01")
```

This plot is even more informative than the differentiation between setosa and versicolor. First of all note that though we get a convergence of the weight, it does not converge to 0. The minimum errors we can get is $2$. Interestingly enough, now we can assign a semi-meaningful confidence level to our predictions since we got `r 2/dim(iris)[1]`of the predicitons wrong with out weights. 

Another thought that arises is that I should separate out setosa from non-setosa first since it is linearly separably. Any uncertainty that will be there is in the next step where we try to distinguish between non linearly-separable data sets. If we do it in reverse order, then the uncertainty will be propagated and become larger.

List of questions and curiosities
---------------------------------

1. **How to figure out the optimum learning rate from the data?**