K Nearest Neighbors

Introduction

KNN, or K-Nearest Neighbors is a supervised machine learning model that is most frequently used for classification. The model looks at each data point to determine the distance between it and the other data points and then classifies the data points into their various groups. As a supervised model, the correct classification of each data point is know when the training model is initiated.

Initially, this dataset had a target value set that was defined in integers. This would work for a regression, but less so in a model that excels in classification, (although KNN can be used for regressions). To aid in the classification, our group choose to split data into specific percentile, 33% percentile and 66% percentile for three categories and 20%/40%/60%/80% percentiles for 5 categories. The analysis was done on the 3-Category dataset and the 5-category dataset. Each set brought forth a new set of challenges.

As part of implementing the KNN model, the data was scaled. Some of the data contained negative values and some data was a magnitude or two larger other data. For every parameter to be of equal consideration in our model, scaling all parameters was necessary. After the data prepped for our KNN model, the target data was categorized in a single column with value of [0, 1, 2] and [0, 1, 2, 3, 4] for the 3-Category and 5-category datasets respectively. Therefore, our output classes were one-hot encoded.

Model Comparison

The same general approach was taken when evaluating the 3-Category and 5-Category datasets: one-hot encoding the output classes vs. not one-hot encoding the output classes. The two models which did not have their output classes one-hot encoded performed marginally well, with their accuracy above the threshold of random chance, 47.0% vs. 33% and 29.4% vs. 20%, respectively..

As compared to the models above, randomly choosing a category of article success was more accurate than the one-hot encoded models. The 3-Category accuracy converged around 25% accuracy where the 5-Category data converged just above 1% accuracy. Whilst neither model crosses the threshold of random chance, the 3-Cagegory model was moderately more accurate than the 5-Category model.

Final Thoughts

In the end, the models in which one-hot encoding was not deployed yielded better accuracy. With enough neighbors, the accuracy was able to surpass the chance percentage of 33% or 20% for the two data sets.

In the Random Forrest analysis, a feature importance analysis was done and it turns out that there are no clear features that can easily predict article popularity. This data is also very heavily skewed right which resulted in our percentiles being quite close together and ultimately the short differences between data points were not immediately obvious to the model; i.e. very hard to differentiate. This is the predominate theory as to why the one-hot encoded models performed so poorly.

As expected, the 3-Category model was a bit more accurate than the 5-Category model. We speculate that is because it is easier for the model to differentiate the distances between neighbors with a bit more ease.