Pages

Thursday, May 8, 2014

Text Mining: 7. Term Network Analysis Using #Ukraine Tweets in R


Hello Readers,

Today we move to the next phase of text mining: network analysis of terms, or keywords from Twitter. This post will continue to use the #Ukraine tweet data from Twitter from the Text Mining 6: K-Medoids Clustering in the Text Mining Series.


So far we have covered hierarchical clustering, and k-medoids clustering, to group and partition the frequent words in tweets. With network analysis, not only can we determine which terms appear together frequently, we can visualize how keywords and tweets are connected as a network of terms. This way, we can resolve the number of connections keywords have with one another, and how many connections a specific keyword has with other keywords. 


Resulting Network of Terms

Using network terminology, our keywords are the 'nodes' in a network, which are called vertices. Connections are named edges. For an example using car makers, let us say the keyword "generalmotors" has a connection to both "ford" and "toyota". The vertex (singular of vertices) is "generalmotors", which has 2 edges linking it with "ford" and "toyota". Both "ford" and "toyota" are vertices as well, and could have other edges with other keywords connected or not connected to "generalmotors", such as "fiat" or "bmw".

Go ahead and load the igraph package in R, and let us begin.


#Ukraine Tweet Terms as a Network


Since we are using the tweet term data from the previous text mining post, we need to take the term document matrix and transform it into a matrix. Next we need to transform it into a Boolean matrix such that it indicates 0 or 1 for existing values other than zero. Therefore, we are just modifying the matrix to indicate yes or no for the term existing in a document.

Code:

# Network analysis

# load ukraine tdm data
load("ukraine.tdm.RData")

# create matrix
ukraine.m <- as.matrix(ukraine.tdm)

# to boolean matrix
ukraine.m[ukraine.m>=1] <- 1

# to term adjacency matrix
# %*% is product of 2 matrices
ukraine.m2 <- ukraine.m %*% t(ukraine.m)

The second matrix we create will show how many 'connections' each term has in a adjacency matrix. This will require the product of two matrices, using the '
%*%' matrix operator. Through the inner product of the terms, we will arrive at the number of times each term appears together in a document.


Adjacency Matrix

Now that we have your adjacency matrix, we can go ahead and plot a network graph using the igraph package, after simplifying the graphic, and setting the labels and degrees of the vertices- the terms.

Plot Code:
# build graph with igraph ####
library(igraph)
# build adjacency graph
ukraine.g <- graph.adjacency(ukraine.m2, weighted=TRUE, mode="undirected")
# remove loops
ukraine.g <- simplify(ukraine.g)
# set labels and degrees of vertices
V(ukraine.g)$label <- V(ukraine.g)$name
V(ukraine.g)$degree <- degree(ukraine.g)

# plot layout fruchterman.reingold
layout1 <- layout.fruchterman.reingold(ukraine.g)
plot(ukraine.g, layout=layout1, vertex.size=20, 
     vertex.label.color="darkred")

Remember, the labels are the names of the vertices, and the degrees are the number of connections to other terms. We create a layout using the
layout.fruchterman.reingold() function. In the plot() options, I specified a larger vertex size (vertex.size=20), and a darker label color (vertex.label.color="darkred") so we can see the terms better:


Ukraine Terms Network Graphic

We can observe "ukraine" being the central term holding the network together (being the term we used to query the tweets). Looking closer at the connections, we see other terms with many connections such as "southeastern", "prorussian", "odessa", and "observes". With the unrest occurring in the southeast by pro-Russian forces, among them in major city of Odessa, people are observing "dozens" "dead", and are in "mourning".


Luckily, there is a way of modifying the igraph object using the label size, label color, vertex border, and connection line size to make it easier to view the terms and connections. Note that for the label size ($label.cex), color ($color), and connection width ($width) are all determined by a percentage of the maximum vertex connections ($degree and E(ukraine.g)).

Improved Plot Code:
# change label size of vertices (nodes) V()
# change the edge color (connections) E()
V(ukraine.g)$label.cex <- 2.2 * V(ukraine.g)$degree / max(V(ukraine.g)$degree) + .2
V(ukraine.g)$label.color <- rgb(0, 0, 0.2, 0.8)
V(ukraine.g)$frame.color <- NA
egam <- (log(E(ukraine.g)$weight) + .4) / max(log(E(ukraine.g)$weight) + .4)
E(ukraine.g)$color <- rgb(.5, .5, 0, egam)
E(ukraine.g)$width <- egam
# improved plot
plot(ukraine.g, layout1)
We now can see the the terms more clearly:


Improved Term Network Graphic

With our new term colors and connections colors, they reflect the number of connections (degrees) each term has with other terms. Thus, we see definite terms that occur together (more connections-degrees) in tweets with larger labels and darker edges (lines). We see popular vertices (other than "ukraine")  in "observes", "east", and "prorussian". 


In the terms on the lower right,  "observes", "dead", "southeastern", "rtcom", "mourning", "days", "dozens", and "crimea", are highly interconnected, indicating that these terms tend to occur together in the same tweet.

There are more inferences you can draw from the #Ukraine term network visual, such as connections of terms to "prorussian". We see that it is heavily connected to "russia", "east", "forces", "foxnews", and "offensive".

On the right hand side, we see the city of "odessa". There have been heavy military activity and rioting, with pro-Russian activists being targeted in the chaos. Russia has equated the attacks with Russian ethnic genocide, which is why "nazis" appeared in the terms connected to "odessa".

Next we will move to network analysis on complete tweets, not just terms in tweets. So stay tuned more more #rstats!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

2 comments:

  1. Bolavita Agen Sabung Ayam Live Online nomor satu di indonesia !

    di www.bolavita.site Minimal Deposit hanya 50ribu saja untuk kamu yang ingin ikut Pasang Taruhan langsung secara Online !

    Dan Di Bolavita kamu juga bisa menikmati Bonus deposit pertama kamu sebesar 10% !

    Informasi Selengkapnya Hubungi Cs kami yang bertugas :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam

    ReplyDelete
  2. You make so many great points here that I read your article a couple of times. Your views are in accordance with my own for the most part. This is great content for your readers. lhd loader

    ReplyDelete