What features are attributes of a benign or malignant breast tumor? This machine learning experiment will attempt a solution for the problem of determining what features occur most frequently in breast tumors. My project uses data from a breast cancer dataset from Wisconsin to predict whether an individual has a benign or malignant tumor based on predictor variables. It includes data, such as 31 features that describe the characteristics of the tumors. Diagnosis was the target variable in this dataset that I am trying to predict. This aligned with a classification algorithm, as the outcomes of the diagnosis were either M (malignant) or B (benign).
To prepare my data, I checked for any null values that would affect the outcome of the experiment. I found that only one of the columns in the data frame had null values. Because the column appeared to be an error and not necessary data, I dropped the entire column from the dataset in place toget my data in shape. After this, I re-checked for null values in the columns and confirmed none remained. I also observed that all of the columns were already numerical values, so I did not need to replace any with dummy variables. These qualities assured me that the dataset was cleaned and ready to be used in the algorithms.
From sklearn, I predicted the target variable, diagnosis, with a random forest classification algorithm. The metric I used to evaluate the predictions was accuracy, which was appropriate because the algorithm used classification. I also displayed a confusion matrix to investigate false positives and false negatives. After tuning parameters and normalizing the data, I recorded each accuracy rate and observed the confusion matrices to determine which algorithm performed the best.
The random forest with 23 trees and a max depth of 7 produced the best accuracy of 96%. My alternative algorithms included a weighted 5-Nearest Neighbors classifier algorithm that had an accuracy of approximately 90%, an unweighted 16-Nearest Neighbors classifier algorithm with an accuracy of 91%, and a decision tree with a max depth of 3 that had an accuracy of approximately 94%. In all of the algorithms, normalizing the data always improved accuracy. This showed that all of the features included should have equal weight in the prediction and were all important to include.