This project utilizes machine learning techniques to predict the popularity of articles on mashable.com. This is the final project for the University of Minnesota's Data Visualization and Analytics Cohort 10. The group members of this project include Steven Gaetz, Natalia Mendoza-Orr, Nate Witte, and Sam Ziegler. This group used K Nearest Neighbors, Random Forest, SVM, and Neural Networks to build a machine learning model that would predict if an article fits into one of three categories based on the number of shares on social media. The categories are Popular, Neutral, and Unpopular.
These data include a set of features about articles that were published by Mashable.com over a period of two years. It includes 39,797 rows and 61 attributes (58 predictive attributes, 2 non-predictive, 1 target field) including URL of the article, days between the article publication and the dataset acquisition, number of words in the title, etc. For a full list of features, please see the link below. The dataset used in this project can be downloaded here: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity.
This project was created with:
The results of the project are shown below
Algorithm | Accuracy (%) |
---|---|
Decision Tree | 42.0 |
Random Forest | 50.9 |
SVM | 32.0 |
KNN | 47.0 |
Neural Network | 47.3 |
As shown above, the random forest algorithm proved to have the highest accuracy of all the models. However, none of the models were particularly effective, with the highest accuracy percentage being just over 50%. This leads to our primary conclusion that the attributes collected in this dataset did not have much predictive power on the number of shares.