Friday, January 10, 2014

Text Mining: 5. Hierarchical Clustering for Frequent Terms in R

Hello Readers,

Today we will discuss clustering the terms with methods we utilized from the previous posts in the Text Mining Series to analyze recent tweets from @TheEconomist. Therefore, I shall post the code for retrieving, transforming, and converting the list data to a data.frame, to a text corpus, and to a term document (TD) matrix. This post shall mainly concentrate on clustering frequent terms from the TD matrix. 

The code can be found on my GitHub! Here

Check out Text Mining: 6 for K-Medoids clustering.

The Economist Twitter Page

Start R, and let us get started!

From Tweets to a Term Document Matrix

This quick introduction will retrieve 400 tweets from @TheEconomist and transform the tweet list into a data.frame, text corpus, and then to a term document matrix. The code is shown so that you can follow along if desired.

First, the code for connecting to the Twitter API and retrieving the tweets with the twitteR and ROAuth packages is shown below. This step was covered in this post about retrieving text from Twitter. **Update: Creating a Twitter OAuth Object is more reliable than using getTwitterOAuth(), covered in link above.**

Retrieving Tweets

Next we shall convert the tweets into a data.frame and then a text corpus using the tm package, which was discussed here.

Data.frame and Text Corpus Transformations

After we have the text corpus, we can start stemming the words for frequency counting, covered here. Remember, requires the SnowballC package. After stemming, we can convert the corpus into a term document matrix.

Stemming and Term Document Conversion

Not to leave out any visualizations, we shall include building a word cloud, found here in a previous post. This requires the wordcloud package.

Creating a Word Cloud

And the result I obtained is shown below:

@TheEconomist Word Cloud

We see that the most frequent terms are "economist", "new", "weeks", "america", "recent", "mandela", and "year", among others.

Hierarchical Clustering

The hierarchical clustering process was introduced in this post. With the tm library loaded, we will work with the econ.tdm term document matrix. 

First we need to eliminate the sparse terms, using the removeSparseTerms() function, ranging from 0 to 1. This sparse percentage denotes the proportion of empty elements. A sparse parameter of 0.7 means that we select from those terms which are less than 70% empty. We set the sparsity at 0.95 so that terms with at least that sparse percentage will be removed. So the terms we accept can be very empty- at most 95% empty. Then we can coerce the TD matrix into a regular matrix.

Removing Sparse Terms

Now we compute the distance matrix for the hclust() function.

Hierarchical Clustering

Naturally we plot the dendrogram, for the cluster tree.

Plotting a Dendrogram

By cutting the dendrogram into 5 clusters, we obtain the plot below. You can cut the dendrogram into a variety of cluster numbers, depending on the vertical distance- the differences between the terms.


We can evaluate the terms in the clusters by using the cutree() function.

Term Groups

Observe that "economist" and "new" are both in their own clusters. We have many terms for cluster 2, "china", "dailychart", "now", "recent", "todays", "view", "weeks", and "world". That makes sense because the @TheEconomist regularly tweets infographic "dailychart[s]" describing "recent" information about the world, or about "todays" news, or hot issues on debate "now".

And there we have it! The other posts used @nbastats, and this post we transitioned to @TheEconomist because tweets from @nbastats included many numbers which were eliminated from the text corpus transformation.

Thanks for reading,




  1. So, is hclustering not worth it on data sets like nbastats?

    1. It depends on the content of the tweets. Since @nbastats has numerical statistics (no surprise), many of them are eliminated when converting to a text corpus. For example, a @fashion twitter handle might include many url links to images/sites of clothes etc, with little text content other than hashtags (#fallfashion or something). If you want to look at hashtags, then it is OK.

      So I chose a handle like the @theEconomist because it has more world news content, and therefore text, than compared to @nbastats.

  2. i would like to know how to validate this hierarchical clustering on twitter dataset in R from other clustering algorithms

  3. I try for the same code but when I go through the steps wordcloud and hcluster the result I found is in numbers not word. can u please help me why is this happening.

  4. The great service in this blog and the nice technology is visible in this blog. I am really very happy for the nice approach is visible in this blog and thank you very much for using the nice technology in this blog

    Data Science Online Training|
    R Programming Online Training|
    Hadoop Online Training

  5. I have read your blog and I gathered some needful information from your blog. Keep update your blog. Awaiting for your next update.

    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|
    Data science training in Marathahalli|
    Data science training in Bangalore|

  6. The considerable administration in this blog and the pleasant innovation is obvious in this blog. I am extremely exceptionally upbeat for the decent approach is noticeable in this blog and much thanks for utilizing the pleasant innovation in this blog

    Data Science
    Text Cluster

  7. Thank you.Well it was nice post and very helpful information on Data Science online Training Hyderabad

  8. Best R Programming Training in Bangalore offered by myTectra. India's No.1 R Programming Training Institute. Classroom, Online and Corporate training in R Programming
    r programming training

  9. ayam bangkok thailand
    Info selengkapnya silahkan hubungi Cs kami :
    WA : +6281377055002
    Live Chat : www.bolavita,cc

  10. Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.
    nebosh course in chennai

  11. Play 918 Kiss, Online Games, Sports Games and Online Live Casino Slots Games at online game malaysia Malaysia. You can find all the best online games at Sports gambling in malaysia Live Today!. Enjoy the fun with our mobile game! Claim your Top up Bonus for 918 Kiss Malaysia today.

  12. jamu ayam aduan kuat nafas Agar ayam kamu menjadi lebih kuat dan memiliki nafas yang panjang !

  13. Untuk mengetahui Cara Pasang taruhan ayam bangkok bulbi anda bisa kunjundi Situs resmi Agen BOLAVITA !
    Mudah, aman dan terpercaya !