Hello Readers,
Today we will continue the text mining series with a post on k-medoids clustering in R. For this analysis I queried 200 recent tweets (May 3rd) using the hashtag #Ukraine, considering the recent escalation of Ukrainian and pro-Russian forces in eastern Ukrainian cities.
You can catch up on the recent Ukraine Crisis in my previous post illustrating events and palladium prices. While we see that international economic sanctions threatening the supply of the rare metal from Russia reflected in the price, what has been the online Twitter reaction to the increasing conflict in the region? To assess online sentiment, we can look to tweets containing the #Ukraine hashtag.
Tweet Retrieval
For those of you who have difficulty in using the twitteR package in R, remember to change the URLs to https:// from http://. I had saved my handshake completed OAuth object from earlier posts, so I simply registered the credential object again.
Because this post will focus on k-means clustering, I will only post the R code, as my other text mining posts explain the tweet/timeline retrieval process.
Code:
require(twitteR) require(OAuth) # load twitter cred load("cred.Rdata") registerTwitterOAuth(cred) # search twitter for ukraine tweets ukraine <- searchTwitter("#Ukraine", n=200, lang="en", cainfo="cacert.pem") # tm ukraine data #### library(tm) # to data.frame ukraine.df <- do.call("rbind", lapply(ukraine, as.data.frame)) # to corpus ukraine.corpus <- Corpus(VectorSource(ukraine.df$text)) # begin transformations ukraine.corpus <- tm_map(ukraine.corpus, tolower) ukraine.corpus <- tm_map(ukraine.corpus, removePunctuation) ukraine.corpus <- tm_map(ukraine.corpus, removeNumbers) removeURLs <- function(x) gsub("http[[:alnum:]]*", "", x) ukraine.corpus <- tm_map(ukraine.corpus, removeURLs) myStopWords <- c(stopwords('english'), "via", "rt") ukraine.corpus <- tm_map(ukraine.corpus, removeWords, myStopWords) # stem and replacement ukraine.corpus.c <- ukraine.corpus ukraine.corpus <- tm_map(ukraine.corpus, stemDocument) ukraine.corpus <- tm_map(ukraine.corpus, stemCompletion, dictionary=ukraine.corpus.c) # to tdm ukraine.tdm <- TermDocumentMatrix(ukraine.corpus, control=list(wordLengths=c(1, Inf))) # remove sparse terms ukraine.tdm <- removeSparseTerms(ukraine.tdm, sparse=0.95)
The remaining terms, not removed by their sparsity, have more than 5% existing in documents:
Terms |
Code:
# wordcloud library(wordcloud) ukraine.m <- as.matrix(ukraine.tdm) wordFreqSort <- sort(rowSums(ukraine.m), decreasing=TRUE) grayLevels <- gray( (wordFreqSort + 10) / (max(wordFreqSort) + 10)) word.cloud <- wordcloud(words=names(wordFreqSort), freq=wordFreqSort, min.freq=3, random.order=F, colors=grayLevels, scale=c(4.5,.9), rot.per=.1)
#Ukraine Word Cloud |
K-Medoids Clustering
Now that we have an idea of the terms from the #Ukraine tweets, and visually from the word cloud, we would like to determine how terms are associated with each other using k-medoids clustering. Which terms are closest to each other, such that they can be partitioned as clusters of terms?
To do so, we use the pamk() function from the fpc package, which will give the optimal cluster number via the average silhouette width (higher = better). pamk() is a wrapper function for the (more robust than k-means clustering) pam() function. PAM stands for "Partitioning Around Medoids". K-medoids uses data points as centers of clusters. It is more robust because it uses pairwise dissimilarities rather than the sum of squared Euclidean distances (vulnerable to outliers).
library(fpc) # create a transposed matrix from tdm ukraine.m <- as.matrix(ukraine.tdm) ukraine.m2 <- t(ukraine.m) # pamk clustering ukraine.pam <- pamk(ukraine.m2, metric="manhattan") # optimal cluster number ukraine.pam$nc [1] 2 # isolate cluster object ukraine.obj <- ukraine.pam$pamobject # print out terms in each cluster for(i in 1:ukraine.pam$nc) { + cat(paste("cluster ", i, ": ", sep="")) + cat(colnames(ukraine.obj$medoids)[which(ukraine.obj$medoids[i,]==1)], "\n") + } cluster 1: ukraine cluster 2: crimea days dead dozens mourning observes rtcom southeastern ukraine # plot clusters and silhouette plot(ukraine.obj)
We see from the average silhouette width, the optimal cluster number is 2. Keeping that in mind, we want to observe which terms are in the two clusters. A for loop will accomplish what we would like to see: cluster 1 contains "ukraine", and cluster 2 contains the other terms, "crimea", "days", "dead", "dozens", "mourning", and etc.
Cluster Plot:
Our results are not fantastic coming from 199 tweets, but the two components explain ~37.57% of the variability. You can observe the clusters below- only 1 is distinct.
Silhouette Plot:
A silhouette width close to 1 indicates a good fit in the cluster (close together), whereas a lower value close to zero means the point is bordering 2 clusters, and a negative value indicates the point most likely is misclassified. The first cluster we have a decent average width of 0.65 for "ukraine", but for cluster two, we see a very high average width of 0.98 for the 9 terms.
What do these results mean? Well, it seems from the terms in the clusters, "ukraine" was used in tweets by itself most of the time. However, some of those tweets included more detail about the region of "prorussian" forces invading "crimea" or "southeastern" "ukraine". Those tweets tended to be similar to each other. So from the clusters, we determine that tweets with #Ukraine, are similar that they include only "ukraine" or "ukraine" and other comparable terms describing the crisis.
Stay tuned for more text analysis!
Thanks for reading,
Wayne
@beyondvalence
ReplyDeleteInformasi Khusus Untuk Kamu Pecinta Sabung Ayam Indonesia !
Agen Bolavita memberikan Bonus sampai dengan Rp 1.000.000,- Khusus Merayakan Natal & Tahun Baru !
Untuk Informasi Selengkapnya langsung saja hubungi cs kami yang bertugas :
WA : +62812-2222-995
BBM : BOLAVITA
Situs : www.bolavits.site
Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam
Baca Artikel Sepak bola terbaru di bolavitasport.news !
Prediksi Angka Togel Online Terupdate di angkamistik.net!