Cluster Analysis Example
Ultimately looking to answer the question of how movies classes may have changed over the last 35 years based on movie tastiness and does it seem consistent with my own observation over the past 35 years.
The final model resulted in 6 clusters named as follows: American Music, Fear_Evil, Action_Minded, Anything_Fat, Forced_Away , Beyond_Criminal . A chart of standardized text measures is provided at the right. This chart allows for a couple of conclusions: 1 . For the most part there has been consistency over the period of time with convergence towards the end 2. We saw a rise into the sass of what appear to be health related reality shows 3.The last few years show a spike up in criminal drama In addressing the question how have the classes changed over the last 35 years there seems to be a lot of neutrality and not a clear conclusion as had expected. Further work on the clustering and feature extraction may help to Improve.
Only $13.90 / page
Rest Its of Analysis The analysis started with a review of the tastiness to determine if there were additional words that could be added to the stoplights that could sharpen up the clusters in terms of the meaning at the end. A few additional words were added given that the subject was movies.For example actor, actress and movie did not seem to add much to the goal of creating themes from the clusters. Next a review of word stemmers including porter, Snowball and Lancaster was made. It was determined that although they acted very similarly in how they stemmed words there were subtle differences basically attributed to how aggressive they were. The Lancaster stemmers seemed the most aggressive and thus for this analysis was my preference. The implementation of word stemming had considerable effect on the silhouette efficient with impacts increasing results by upwards of .
5 and even more substantially as the number of clusters were grown. The downside came in the interpretation of those words as some of the aggressive stemming provide a very low level root word. With respect to the number of words extracted from the tagging corpus to be used in clustering process it was found the more words that were used translated into lower silhouette coefficients for the clusters. While more words provided for a richer set to cluster from the goal was not to provide so many words that it depleted the Littleton coefficient that it brought the efficacy of the clustering structure into question.Accordingly, the goal was to try’ to not deplete the silhouette coefficient below 0. 20. In addition to extracted words a variety of appraise distance metrics were evaluated.
Again subtle differences were identified as with the stemmers and personal preference from that evaluation was the Euclidean distance and accordingly that was used for the modeling. Finally a comparison of models between Smears and Matchmaking’s was performed. Again the differences found were only subtle between these two algorithms.One thing noticed during the evaluation was that the Amiableness’s ran faster than the Smears. For this model performance was not an issue but if the size of the corpus was increased substantially then the Matchmaking’s may be preferable due to speed of execution. The table here provides a summary of the silhouette coefficient under the 2 models using stemmed and unsettled data for various cluster sizes. It was predetermined that more than 8 may be unmanageable.
From this analysis it was decided to use a stemmers, the Smears algorithm and 250 extracted words with 6 clusters.