#Excercises for the course Machine Learning by Stanford
The exercises correspond to the course available through Coursera from September through November 2016.
These are my solutions to the programming assignments.
Week 4 - Neural Networks: Representation Week 3 - Logistic Regression Week 2 - Linear Regression with Multiple Variables
##Week 4 Neural Networks RepresentationThis week, we implemented a one-vs-all logistic regression classification algorithm to recognize handwritten digits. We also used a neural network to predict the digits given a layer of pre-learned weights and applying the feedforward propagation algorithm.
In the one vs all method, we want to train a classification algorithm for each of the possible digits from 1-10. For this, we first had to do the vectorized implementation of logistic regression which included the vectorized implementation of the cost function and the gradient function.
Once we had these implementations, we used the fmincg function from octave which performs better than the fminunc with a large number of parameters.
Something really interesting, was that we would get the probability of the digit being any of the digits from 1-10 and we would use the max function to get the index of the digit represented that had the maximum probability.
predictions_for_each_k = sigmoid( X * all_theta' );
[k_probability, k_value_predicted] = max( predictions_for_each_k, [], 2);
p = k_value_predicted;
The algorithm classified with 94.9% accuracy correctly on the training set.
For the neural networks, we were given a pre-trained set of Theta1 and Theta2 that would be used to implement the feedforward propagation algorithm.
We only had to complete the prediction code to calculate the output of the hidden layer and the output layer.
Then, the code provided would randomly create samples and predict using our code with a 97.5% accuracy.
##Week 3 Logistic RegressionThis week, we solved two problems. One, was to predict if a student would be admitted or not to a certain college given the results of two admission exams and historic data on the acceptance of other students.
The second problem, was to predict if a microchip at a factory should be accepted or rejected depending on two tests. In this example, we applied regularization.
The first step to understand the problem, was to visualize the data.
We then had to create a function that is able to calculate the sigmoid for a vector or a matrix. Instead of looping for each element, I did the vectorized implementation using the dot division for each element.
The vectorized implementation of the sigmoid function was one line of code 😎
g = 1 ./ (1 + exp(-z));
To calculate the cost and gradient, I also used vectorized implementation. The formulas for the hypothesis, the cost and the gradient vectorized are:
Hypothesis for logistic regression vectorized
Cost function vectorized
Gradient vectorized
The code was also very short as a result:
h = sigmoid(X*theta);
J = (1/m) * (-y' * log(h) - (1-y)' * log(1 - h));
grad = (1/m) * X' * (h - y);
Given a dataset of samples, we want to find the prediction results from our hypothesis with the theta that gives the lowest cost.
The objective was to return 1 if the prediction was above or equal to the threshold of 0.5 or return 0 if it was less than 0.5.
I managed to do this in Octave in one line of code as well by doing the vectorized sigmoid hypothesis and comparing it to 0.5.
p = sigmoid(X*theta) >= 0.5;
How the data looks
Mapping Features to a six degree polynomial
The data could not be linearly separated, which meant we had to create a more complex polynomial that could fit the data based on the current features. For this, we elevated the features to a sixth degree polynomial.
Cost Function with Regularization The regularization parameter is added to the cost function. To calculate the regularization parameter, we actually exclude the first theta0 as this should not be regularized.
Gradient with Regularization To calculate the gradient with regularization, we calculate the regularization parameter and add it to all the gradient except for theta0 where we don't add anything.
Once we were able to add regularization, I experimented with lambda being very low (0), 1 and very high (100). As a result, we can see how the decision boundary performs in each case and what was the best lambda for the just right results. We got an accuracy of 83% on the dataset with lambda 1.
High Bias λ=0 | Just Right λ=1 | High Variance λ=100 |
---|---|---|
This week, we calculated the profit of a food truck company based on the data of profits each food truck has in different cities and their corresponding populations.
The mandatory exercises were of gradient descent with one feature and the optional ones have multiple features.
I solved the gradient descent with one feature doing an iteration over the sum of the prediction deviations and then over the number of features and then over the number of iterations that actually gradient descent run through.
I also solved it with the vectorized/matriz implementation which is much quicker.
Before that, I had to calculate the cost function which I did using the vectorized method.
The gradient descent was able to predict after 1500 iterations, the best values for theta that would converge in the minimum. The corresponding hypothesis visualized among the data looks like this.
With these results, we were able to predict the profit for a food truck given a city with a different population.
The next graphs show the surface and contour plots that allow us to visualize the minimum value of thetas that produce the most accurate hypothesis.
Surface | Contour Plot |
---|---|
The exercise is to predict the sell price of a house given two features: It's size and the number of bedrooms it has.
The first step was to normalize the features using mean normalization. This will guarantee that all features are within the range of -1 <= xi <=1 and that the normalized matrix with have mean 0 and standard deviation 1. To do this, I used dimensional analysis instead of loops to calculate the normalized matrix.
The normalization formula was:
The vectorized/matrix implementations done previously for the cost calculation and the gradient descent also apply to the multiple variables, since we are using the same hypothesis.
To get insights into what the best learning rate for the algorithm was, I plotted multiple figures, each with a learning rate that increased at about 3 times the previous rate. The best learning value found was about 1 as the algorithm started to diverge at around 1.5.
Different Alpha Rates Tested | Best Learning Rate found |
---|---|
With the normal equation, we can also find theta without having to set alpha or to iterate like in gradient descent. This is good for a low amount of features but would be bad if n is very larger. The normal equation is: