Valence Analytics: Text Mining: 5. Hierarchical Clustering for Frequent Terms in R

Friday, January 10, 2014

Text Mining: 5. Hierarchical Clustering for Frequent Terms in R

Hello Readers,

Today we will discuss clustering the terms with methods we utilized from the previous posts in the Text Mining Series to analyze recent tweets from @TheEconomist. Therefore, I shall post the code for retrieving, transforming, and converting the list data to a data.frame, to a text corpus, and to a term document (TD) matrix. This post shall mainly concentrate on clustering frequent terms from the TD matrix.

The code can be found on my GitHub! Here

Check out Text Mining: 6 for K-Medoids clustering.

The Economist Twitter Page

Start R, and let us get started!

From Tweets to a Term Document Matrix

This quick introduction will retrieve 400 tweets from @TheEconomist and transform the tweet list into a data.frame, text corpus, and then to a term document matrix. The code is shown so that you can follow along if desired.

First, the code for connecting to the Twitter API and retrieving the tweets with the twitteR and ROAuth packages is shown below. This step was covered in this post about retrieving text from Twitter. **Update: Creating a Twitter OAuth Object is more reliable than using getTwitterOAuth(), covered in link above.**

Retrieving Tweets

Next we shall convert the tweets into a data.frame and then a text corpus using the tm package, which was discussed here.

Data.frame and Text Corpus Transformations

After we have the text corpus, we can start stemming the words for frequency counting, covered here. Remember, requires the SnowballC package. After stemming, we can convert the corpus into a term document matrix.

Stemming and Term Document Conversion

Not to leave out any visualizations, we shall include building a word cloud, found here in a previous post. This requires the wordcloud package.

Creating a Word Cloud

And the result I obtained is shown below:

@TheEconomist Word Cloud

We see that the most frequent terms are "economist", "new", "weeks", "america", "recent", "mandela", and "year", among others.

Hierarchical Clustering

The hierarchical clustering process was introduced in this post. With the tm library loaded, we will work with the econ.tdm term document matrix.

First we need to eliminate the sparse terms, using the removeSparseTerms() function, ranging from 0 to 1. This sparse percentage denotes the proportion of empty elements. A sparse parameter of 0.7 means that we select from those terms which are less than 70% empty. We set the sparsity at 0.95 so that terms with at least that sparse percentage will be removed. So the terms we accept can be very empty- at most 95% empty. Then we can coerce the TD matrix into a regular matrix.

Removing Sparse Terms

Now we compute the distance matrix for the hclust() function.

Hierarchical Clustering

Naturally we plot the dendrogram, for the cluster tree.

Plotting a Dendrogram

By cutting the dendrogram into 5 clusters, we obtain the plot below. You can cut the dendrogram into a variety of cluster numbers, depending on the vertical distance- the differences between the terms.

Dendrogram

We can evaluate the terms in the clusters by using the cutree() function.

Term Groups

Observe that "economist" and "new" are both in their own clusters. We have many terms for cluster 2, "china", "dailychart", "now", "recent", "todays", "view", "weeks", and "world". That makes sense because the @TheEconomist regularly tweets infographic "dailychart[s]" describing "recent" information about the world, or about "todays" news, or hot issues on debate "now".

And there we have it! The other posts used @nbastats, and this post we transitioned to @TheEconomist because tweets from @nbastats included many numbers which were eliminated from the text corpus transformation.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

27 comments:

CmoneyWed Aug 06, 05:37:00 PM
So, is hclustering not worth it on data sets like nbastats?
ReplyDelete
Replies
MathunaFri May 15, 05:21:00 AM
i would like to know how to validate this hierarchical clustering on twitter dataset in R from other clustering algorithms
ReplyDelete
Replies
UnknownSat Mar 05, 04:36:00 AM
I try for the same code but when I go through the steps wordcloud and hcluster the result I found is in numbers not word. can u please help me why is this happening.
ReplyDelete
Replies
venkatWed Aug 02, 04:29:00 AM
The great service in this blog and the nice technology is visible in this blog. I am really very happy for the nice approach is visible in this blog and thank you very much for using the nice technology in this blog

Data Science Online Training|
R Programming Online Training|
Hadoop Online Training

ReplyDelete
Replies
roseTue Aug 29, 12:14:00 AM
I have read your blog and I gathered some needful information from your blog. Keep update your blog. Awaiting for your next update.

Hadoop Training in Marathahalli|
Hadoop Training in Bangalore|
Data science training in Marathahalli|
Data science training in Bangalore|
ReplyDelete
Replies
UnknownFri Dec 22, 07:00:00 AM
The considerable administration in this blog and the pleasant innovation is obvious in this blog. I am extremely exceptionally upbeat for the decent approach is noticeable in this blog and much thanks for utilizing the pleasant innovation in this blog

Tamahagane
Data Science
Text Cluster
ReplyDelete
Replies
SumarniSat Feb 23, 11:43:00 AM
Tayangan Video Adu Ayam Birma Live s128 dan Adu Ayam Sv388 Terpercaya di Indonesia
ReplyDelete
Replies
918KissFri Mar 29, 05:49:00 AM
PlayStation games that are being offered these days are incredible as they have amazing graphics that have been added by the designers. They are very well designed and executed very well. There are a number of people of people who love 918kiss Apk playing these games and spend hours doing so. They have a number of parties which involve playing games with their folks. They have a lot of fun during this period.
ReplyDelete
Replies
SCR888 CafeFri Mar 29, 11:34:00 AM
Instead dive into the online space creating a fun free and unique experience. Though the genre Scr888 Download hybrid a sci-fi MMO shooter in an ever-changing open world adventure feel in. Keep your eyes open and blink whenever you want wherever you are the NFL.
ReplyDelete
Replies
Kamora BurtonSat Mar 30, 01:04:00 PM
As we reported previously the NFL week in a row Direct TV to pay. 2 a never-ending Conflict on an air conditioned room this week However. Step 2 make a mega888 Free Download decision queues and adds a more Battlefield-style teamwork ethic especially when you are. The host city building and places it more in adventuring you will enjoy an open world gameplay.
ReplyDelete
Replies
Garry L. ShouseWed Apr 10, 10:06:00 AM
Overall Super Dragon Ball Z would be best described as "slightly different" than "new and improved." Not everyone will agree 918kiss that this is a waste of money but be sure to rent it first if you have any doubts.
ReplyDelete
Replies
Louis J. RichieMon Apr 22, 10:16:00 AM
If there is one thing about this PC gamepad scr888 malaysia that immediately sells it, it is how great it feels in the player's hands.
ReplyDelete
Replies
918kiss malaysiaWed Jul 31, 03:28:00 AM
This is a really good read for me, Must trusted online casino malaysia 2019 admit that you are one of the best bloggers I ever saw.Thanks for posting this informative article.
ReplyDelete
Replies
siberiaSun Dec 01, 06:36:00 AM
Thanks for posting an article on this topic. and good information. I really enjoyed reading this article which explained everything in very easy manner
Sikerja
MantapMantap
SIKERJA.ID
Mantapmantap
www.sikerja.id
sikerja
Indo Sikerja
Indo MantapMantap
INDO SIKERJA.ID
Indo Mantapmantap
Indo www.sikerja.id
Indo sikerja
Indonesia Sikerja
Indonesia MantapMantap
INDONESIA SIKERJA.ID
Indonesia Mantapmantap
Indonesia www.sikerja.id
Indonesia sikerja
Indo cek Sikerja
Indo cek MantapMantap
INDO CEK SIKERJA.ID
Indo cek Mantapmantap
Indo cek www.sikerja.id
Indo cek sikerja
Indonesia cek Sikerja
Indonesia cek MantapMantap
INDONESIA CEK SIKERJA.ID
Indonesia cek Mantapmantap
Indonesia cek www.sikerja.id
Indonesia cek sikerja
ReplyDelete
Replies
Data Science CourseTue Mar 03, 11:35:00 AM
Really awesome blog!!! I finally found great post here.I really enjoyed reading this article. Thanks for this nice information. Excellent work! I will get back here.
Data Science Course

Data Science Course in Marathahalli

Data Science Course Training in Bangalore
ReplyDelete
Replies
aarthiWed Jul 08, 06:40:00 AM
Thank you for posting the article...it is very informative.
Java training in Chennai | Certification | Online Course Training | Java training in Bangalore | Certification | Online Course Training | Java training in Hyderabad | Certification | Online Course Training | Java training in Coimbatore | Certification | Online Course Training | Java training in Online | Certification | Online Course Training

ReplyDelete
Replies
sherlieThu Jul 30, 09:05:00 AM
Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome.
THANKS.
Web Designing Training in Chennai

Web Designing Course in Chennai

Web Designing Training in Bangalore

Web Designing Course in Bangalore

Web Designing Training in Hyderabad

Web Designing Course in Hyderabad

Web Designing Training in Coimbatore

Web Designing Training

Web Designing Online Training
ReplyDelete
Replies
360digiTMGSat Sep 26, 03:35:00 AM
I am impressed by the information that you have on this blog. It shows how well you understand this subject.
360DigiTMG
ReplyDelete
Replies
360digitmgMon Dec 28, 09:54:00 AM
Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
360DigiTMG data science training in ECIL

ReplyDelete
Replies
Dynamic Sales SolutionsThu Feb 25, 03:11:00 AM
Really I enjoy your site with effective and useful information. It is included very nice post with a lot of our resources.thanks for share. i enjoy this post.

SEO Cheltenham
SEO Agency Gloucester
Locam SEO Company Cheltenham
SEO Agency Cheltenham
Local SEO Agency
ReplyDelete
Replies
traininginstituteTue Oct 12, 10:15:00 PM
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
data science training
ReplyDelete
Replies
traininginstituteTue Nov 02, 11:42:00 PM
This is really very nice post you shared, i like the post, thanks for sharing..
data scientist training in malaysia
ReplyDelete
Replies
Data ScienceTue Feb 01, 03:43:00 AM
Amazingly by and large very interesting post. I was looking for such an information and thoroughly enjoyed examining this one.
Keep posting. An obligation of appreciation is all together for sharing.
data science course in gwalior

ReplyDelete
Replies
data scienceSat Feb 26, 02:13:00 AM
I am overwhelmed by your post with such a nice topic. Usually I visit your blogs and get updated through the information you include but today’s blog would be the most appreciable. Well done! data analyst course malaysia
ReplyDelete
Replies
AkshtaUncodemyFri Nov 04, 02:52:00 AM
Overall, this piece was extremely fascinating. I like reading this because I was looking for information of this nature. Post more often. It is appropriate to express gratitude for sharing. Enroll at data science Training in Bhopal
ReplyDelete
Replies
Dynamic Sales SolutionsWed Jun 07, 08:05:00 AM
Our expert team offers a wide range of services, including website design and management, digital marketing, and search engine optimization. With years of experience under our belts, we've helped countless businesses in both the United Kingdom, Local SEO and United States succeed online. Whether you need graphic design, photography, video production, hosting and domains or done-for-you services, we've got it covered.
ReplyDelete
Replies

Add comment

Pages

Friday, January 10, 2014

Text Mining: 5. Hierarchical Clustering for Frequent Terms in R

From Tweets to a Term Document Matrix

Hierarchical Clustering

27 comments: