Pages

Sunday, May 4, 2014

Text Mining: 6. K-Medoids Clustering of #Ukraine Tweets in R


Hello Readers,

Today we will continue the text mining series with a post on k-medoids clustering in R. For this analysis I queried 200 recent tweets (May 3rd) using the hashtag #Ukraine, considering the recent escalation of Ukrainian and pro-Russian forces in eastern Ukrainian cities.

You can catch up on the recent Ukraine Crisis in my previous post illustrating events and palladium prices. While we see that international economic sanctions threatening the supply of the rare metal from Russia reflected in the price, what has been the online Twitter reaction to the increasing conflict in the region? To assess online sentiment, we can look to tweets containing the #Ukraine hashtag.


Tweet Retrieval


For those of you who have difficulty in using the twitteR package in R, remember to change the URLs to https:// from http://. I had saved my handshake completed OAuth object from earlier posts, so I simply registered the credential object again.

Because this post will focus on k-means clustering, I will only post the R code, as my other text mining posts explain the tweet/timeline retrieval process.

Code:
require(twitteR)
require(OAuth)

# load twitter cred
load("cred.Rdata")
registerTwitterOAuth(cred)

# search twitter for ukraine tweets
ukraine <- searchTwitter("#Ukraine", n=200, lang="en", cainfo="cacert.pem")

# tm ukraine data ####
library(tm)
# to data.frame
ukraine.df <- do.call("rbind", lapply(ukraine, as.data.frame))
# to corpus
ukraine.corpus <- Corpus(VectorSource(ukraine.df$text))
# begin transformations
ukraine.corpus <- tm_map(ukraine.corpus, tolower)
ukraine.corpus <- tm_map(ukraine.corpus, removePunctuation)
ukraine.corpus <- tm_map(ukraine.corpus, removeNumbers)
removeURLs <- function(x) gsub("http[[:alnum:]]*", "", x)
ukraine.corpus <- tm_map(ukraine.corpus, removeURLs)
myStopWords <- c(stopwords('english'), "via", "rt")
ukraine.corpus <- tm_map(ukraine.corpus, removeWords, myStopWords)
# stem and replacement
ukraine.corpus.c <- ukraine.corpus
ukraine.corpus <- tm_map(ukraine.corpus, stemDocument)
ukraine.corpus <- tm_map(ukraine.corpus, stemCompletion,
                         dictionary=ukraine.corpus.c)
# to tdm
ukraine.tdm <- TermDocumentMatrix(ukraine.corpus, 
                                  control=list(wordLengths=c(1, Inf)))
# remove sparse terms
ukraine.tdm <- removeSparseTerms(ukraine.tdm, sparse=0.95)

The remaining terms, not removed by their sparsity, have more than 5% existing in documents:



Terms
Very quickly we can create a word cloud from the terms using the wordcloud package:

Code:
# wordcloud
library(wordcloud)
ukraine.m <- as.matrix(ukraine.tdm)
wordFreqSort <- sort(rowSums(ukraine.m), decreasing=TRUE)
grayLevels <- gray( (wordFreqSort + 10) / (max(wordFreqSort) + 10))
word.cloud <- wordcloud(words=names(wordFreqSort), freq=wordFreqSort, 
                        min.freq=3, random.order=F, colors=grayLevels,
                        scale=c(4.5,.9), rot.per=.1)

#Ukraine Word Cloud
We can see major terms in the word cloud in larger font aside from "ukraine", such as "russia", "odessa", "prorussian", "crimea", and others.



K-Medoids Clustering


Now that we have an idea of the terms from the #Ukraine tweets, and visually from the word cloud, we would like to determine how terms are associated with each other using k-medoids clustering. Which terms are closest to each other, such that they can be partitioned as clusters of terms?


To do so, we use the
pamk() function from the fpc package, which will give the optimal cluster number via the average silhouette width (higher = better). pamk() is a wrapper function for the (more robust than k-means clustering) pam() function. PAM stands for "Partitioning Around Medoids". K-medoids uses data points as centers of clusters. It is more robust because it uses pairwise dissimilarities rather than the sum of squared Euclidean distances (vulnerable to outliers).


library(fpc)

# create a transposed matrix from tdm
ukraine.m <- as.matrix(ukraine.tdm)
ukraine.m2 <- t(ukraine.m)

# pamk clustering
ukraine.pam <- pamk(ukraine.m2, metric="manhattan")
# optimal cluster number
ukraine.pam$nc
[1] 2
# isolate cluster object
ukraine.obj <- ukraine.pam$pamobject
# print out terms in each cluster
for(i in 1:ukraine.pam$nc) {
+ cat(paste("cluster ", i, ": ", sep=""))
+ cat(colnames(ukraine.obj$medoids)[which(ukraine.obj$medoids[i,]==1)], "\n")
+ }
cluster 1: ukraine 
cluster 2: crimea days dead dozens mourning observes rtcom southeastern ukraine 

# plot clusters and silhouette
plot(ukraine.obj)

We see from the average silhouette width, the optimal cluster number is 2. Keeping that in mind, we want to observe which terms are in the two clusters. A for loop will accomplish what we would like to see: cluster 1 contains "ukraine", and cluster 2 contains the other terms, "crimea", "days", "dead", "dozens", "mourning", and etc.


Cluster Plot:

Our results are not fantastic coming from 199 tweets, but the two components explain ~37.57% of the variability. You can observe the clusters below- only 1 is distinct.




Silhouette Plot:
A silhouette width close to 1 indicates a good fit in the cluster (close together), whereas a lower value close to zero means the point is bordering 2 clusters, and a negative value indicates the point most likely is misclassified. The first cluster we have a decent average width of 0.65 for "ukraine", but for cluster two, we see a very high average width of 0.98 for the 9 terms.




What do these results mean? Well, it seems from the terms in the clusters, "ukraine" was used in tweets by itself most of the time. However, some of those tweets included more detail about the region of "prorussian" forces invading "crimea" or "southeastern" "ukraine". Those tweets tended to be similar to each other. So from the clusters, we determine that tweets with #Ukraine, are similar that they include only "ukraine" or "ukraine" and other comparable terms describing the crisis.

Stay tuned for more text analysis!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

1 comment:


  1. Informasi Khusus Untuk Kamu Pecinta Sabung Ayam Indonesia !

    Agen Bolavita memberikan Bonus sampai dengan Rp 1.000.000,- Khusus Merayakan Natal & Tahun Baru !

    Untuk Informasi Selengkapnya langsung saja hubungi cs kami yang bertugas :
    WA : +62812-2222-995
    BBM : BOLAVITA
    Situs : www.bolavits.site

    Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam


    Baca Artikel Sepak bola terbaru di bolavitasport.news !
    Prediksi Angka Togel Online Terupdate di angkamistik.net!

    ReplyDelete