Valence Analytics: clustering

Friday, January 10, 2014

Text Mining: 5. Hierarchical Clustering for Frequent Terms in R

Hello Readers,

Today we will discuss clustering the terms with methods we utilized from the previous posts in the Text Mining Series to analyze recent tweets from @TheEconomist. Therefore, I shall post the code for retrieving, transforming, and converting the list data to a data.frame, to a text corpus, and to a term document (TD) matrix. This post shall mainly concentrate on clustering frequent terms from the TD matrix.

The code can be found on my GitHub! Here

Check out Text Mining: 6 for K-Medoids clustering.

The Economist Twitter Page

Start R, and let us get started!

From Tweets to a Term Document Matrix

This quick introduction will retrieve 400 tweets from @TheEconomist and transform the tweet list into a data.frame, text corpus, and then to a term document matrix. The code is shown so that you can follow along if desired.

First, the code for connecting to the Twitter API and retrieving the tweets with the twitteR and ROAuth packages is shown below. This step was covered in this post about retrieving text from Twitter. **Update: Creating a Twitter OAuth Object is more reliable than using getTwitterOAuth(), covered in link above.**

Retrieving Tweets

Next we shall convert the tweets into a data.frame and then a text corpus using the tm package, which was discussed here.

Data.frame and Text Corpus Transformations

After we have the text corpus, we can start stemming the words for frequency counting, covered here. Remember, requires the SnowballC package. After stemming, we can convert the corpus into a term document matrix.

Stemming and Term Document Conversion

Not to leave out any visualizations, we shall include building a word cloud, found here in a previous post. This requires the wordcloud package.

Creating a Word Cloud

And the result I obtained is shown below:

@TheEconomist Word Cloud

We see that the most frequent terms are "economist", "new", "weeks", "america", "recent", "mandela", and "year", among others.

Hierarchical Clustering

The hierarchical clustering process was introduced in this post. With the tm library loaded, we will work with the econ.tdm term document matrix.

First we need to eliminate the sparse terms, using the removeSparseTerms() function, ranging from 0 to 1. This sparse percentage denotes the proportion of empty elements. A sparse parameter of 0.7 means that we select from those terms which are less than 70% empty. We set the sparsity at 0.95 so that terms with at least that sparse percentage will be removed. So the terms we accept can be very empty- at most 95% empty. Then we can coerce the TD matrix into a regular matrix.

Removing Sparse Terms

Now we compute the distance matrix for the hclust() function.

Hierarchical Clustering

Naturally we plot the dendrogram, for the cluster tree.

Plotting a Dendrogram

By cutting the dendrogram into 5 clusters, we obtain the plot below. You can cut the dendrogram into a variety of cluster numbers, depending on the vertical distance- the differences between the terms.

Dendrogram

We can evaluate the terms in the clusters by using the cutree() function.

Term Groups

Observe that "economist" and "new" are both in their own clusters. We have many terms for cluster 2, "china", "dailychart", "now", "recent", "todays", "view", "weeks", and "world". That makes sense because the @TheEconomist regularly tweets infographic "dailychart[s]" describing "recent" information about the world, or about "todays" news, or hot issues on debate "now".

And there we have it! The other posts used @nbastats, and this post we transitioned to @TheEconomist because tweets from @nbastats included many numbers which were eliminated from the text corpus transformation.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Wednesday, December 25, 2013

Cluster Analysis: Using K-Means in R

Hello Readers,

Hope you guys are having a wonderful holiday! (I am.) Today in this post we will cover the k-means clustering technique in R.

We will use the familiar iris data set available in R.

Let us get started!

K-Means Clustering

Start by loading the cluster package by library(cluster) in R. The first and very important step in k-means clustering occurs when choosing the number of final clusters (k).

Therefore the k-means clustering process begins with an educated 'guess' of the number of clusters. With the k number of clusters, R selects k observations in the data to serve as cluster centers. Then the Euclidean distance is calculated from observations to the cluster centers, and the observations are placed in the cluster to which they are closest.

Then the center of each cluster is recalculated and the Euclidean distance is taken for each observation and the new cluster center. R checks every observation to see if it is closer to another cluster center and reassigns it if it is closer to another cluster. The process of center cluster recalculation and observation distance checking is repeated until observations stay in the same cluster.

The kmeans() function requires the choosing k observations as the centers of the clusters.

Before clustering, remove the species column from the iris data set to retain the numerical values only.

Iris2 Data Set Without Species

Now use the kmeans() function on iris2 with 3 centers, or number of clusters, as shown below with the output.

K-Means Result

Note that the output includes the size of each cluster (50, 38, 62), the means of each variable in each cluster, the vector of the cluster number, the withinss for each cluster, and the components of the km.result object.

kmeans() seeks to minimize the withinss for each cluster, which is the sum of of squared error (SSE) or scatter. It is calculated by taking the sum of the squares of the distances between the observations and centroid of each cluster.

To see how well the k-means clustering performed, we can create a table with the actual species of iris and the cluster numbers:

Species and Cluster Comparison

As we can see, the clustering successfully clustered all the setosa species, but had difficulty with virginica (14 off) and versicolor (2 off). To quantify the agreement we can use the the library(flexclust) package, as shown below:

Adjusted Rank Index

An agreement of 0.73 is not too bad, as it ranges from -1 (no agreement) to 1 (perfect agreement). We have a 0.73 agreement between the iris species and cluster solution.

Plotting Clusters

Next we can visualize the clusters and their centers we constructed, with the code below:

Plotting Code

And that yields the visual below. Note that the three diamonds are the cluster centers in black, green and red.

Iris Clusters Plot

The model looks OK, except for a few red cluster points close to the green center, possibly miss classification of the virginica and versicolor species, because setosa was completely categorized properly. As we can see, the clustering is not perfect, hence an agreement score of 0.73.

In the next Cluster Analysis post I will discuss finding a suitable k to begin the k-means analysis. In this case, we knew we had 3 species to begin, so it was easy to plug in the k. However, we will look at the within sum of squares and sets of criteria to see what k we will use for a data set on wine. Stay tuned!

Thanks for reading,

Wayne
@beyondvalence

Pages

Friday, January 10, 2014

Text Mining: 5. Hierarchical Clustering for Frequent Terms in R

From Tweets to a Term Document Matrix

Hierarchical Clustering

Wednesday, December 25, 2013

Cluster Analysis: Using K-Means in R

K-Means Clustering

Plotting Clusters