Machine Learning Project 2 - K Nearest Neighbors Methods
Description: For this project, I worked in a team with two other students at Montana State University. We were tasked with coding a KNN classification model, KNN regression model, edited KNN classification model, edited KNN regression model, and a K-Means clustering model from scratch. We then used these models on a variety of datasets. We used the Breast Cancer Wisconsin Dataset, Glass Identification Dataset, Soybean (Small) Dataset, Abalone Dataset, Forest Fires Dataset, and the Computer Hardware Dataset. For the classification datasets, a base KNN classification model was created. To do this we took out a hold-out fold from the data to tune all the hyperparameters for the model. Once the model was tuned, 10-fold cross validation was completed with the remaining data. There was no removal of noisy data during this run. We then completed another run of this experiment, but we used an edited KNN classification model, so the model would remove data points that it got wrong. This in turn ended up removing noise. A similar process was completed for the regression datasets, but the only difference in the model was that a kernel was used to weight the distance of the K closest neighbors instead of just going with the mean of the k closest neighbors. For the edited KNN regression model, we had to tune a threshold for how close the prediction was to the actual value to decide if the model would remove the datapoint. Finally, we performed the experiment one more time, but instead of using edited KNN models to find the edited dataset, we used the centroids from a K-Means Clustering model. Once these experiments were completed, we then analyzed and compared the results from all the different models
Results: Many of the K Nearest Neighbors Methods performed better on classification data then on regression data. The performance of individual models depended on the structure of each dataset. Depending on if the dataset contained large amounts of outliers, the K-Means model or the edited model would perform better.
Technologies: Python, Numpy, Matplotlib, UML, Latex
Note: If you would like to see the full design document, code base, and research paper that goes with this project please feel free to reach out to me by email.