Loading...
Showing posts with label term document. Show all posts
Showing posts with label term document. Show all posts

Friday, January 10, 2014

Text Mining: 5. Hierarchical Clustering for Frequent Terms in R


Hello Readers,



Today we will discuss clustering the terms with methods we utilized from the previous posts in the Text Mining Series to analyze recent tweets from @TheEconomist. Therefore, I shall post the code for retrieving, transforming, and converting the list data to a data.frame, to a text corpus, and to a term document (TD) matrix. This post shall mainly concentrate on clustering frequent terms from the TD matrix. 

The code can be found on my GitHub! Here

Check out Text Mining: 6 for K-Medoids clustering.


The Economist Twitter Page

Start R, and let us get started!


From Tweets to a Term Document Matrix




This quick introduction will retrieve 400 tweets from @TheEconomist and transform the tweet list into a data.frame, text corpus, and then to a term document matrix. The code is shown so that you can follow along if desired.

First, the code for connecting to the Twitter API and retrieving the tweets with the twitteR and ROAuth packages is shown below. This step was covered in this post about retrieving text from Twitter. **Update: Creating a Twitter OAuth Object is more reliable than using getTwitterOAuth(), covered in link above.**


Retrieving Tweets

Next we shall convert the tweets into a data.frame and then a text corpus using the tm package, which was discussed here.


Data.frame and Text Corpus Transformations

After we have the text corpus, we can start stemming the words for frequency counting, covered here. Remember, requires the SnowballC package. After stemming, we can convert the corpus into a term document matrix.


Stemming and Term Document Conversion

Not to leave out any visualizations, we shall include building a word cloud, found here in a previous post. This requires the wordcloud package.


Creating a Word Cloud

And the result I obtained is shown below:


@TheEconomist Word Cloud

We see that the most frequent terms are "economist", "new", "weeks", "america", "recent", "mandela", and "year", among others.



Hierarchical Clustering



The hierarchical clustering process was introduced in this post. With the tm library loaded, we will work with the econ.tdm term document matrix. 

First we need to eliminate the sparse terms, using the removeSparseTerms() function, ranging from 0 to 1. This sparse percentage denotes the proportion of empty elements. A sparse parameter of 0.7 means that we select from those terms which are less than 70% empty. We set the sparsity at 0.95 so that terms with at least that sparse percentage will be removed. So the terms we accept can be very empty- at most 95% empty. Then we can coerce the TD matrix into a regular matrix.

Removing Sparse Terms

Now we compute the distance matrix for the hclust() function.

Hierarchical Clustering

Naturally we plot the dendrogram, for the cluster tree.



Plotting a Dendrogram

By cutting the dendrogram into 5 clusters, we obtain the plot below. You can cut the dendrogram into a variety of cluster numbers, depending on the vertical distance- the differences between the terms.


Dendrogram

We can evaluate the terms in the clusters by using the cutree() function.


Term Groups

Observe that "economist" and "new" are both in their own clusters. We have many terms for cluster 2, "china", "dailychart", "now", "recent", "todays", "view", "weeks", and "world". That makes sense because the @TheEconomist regularly tweets infographic "dailychart[s]" describing "recent" information about the world, or about "todays" news, or hot issues on debate "now".

And there we have it! The other posts used @nbastats, and this post we transitioned to @TheEconomist because tweets from @nbastats included many numbers which were eliminated from the text corpus transformation.

Thanks for reading,


Wayne

@beyondvalence
LinkedIn