Valence Analytics: twitteR package

Showing posts with label twitteR package. Show all posts

Friday, July 18, 2014

#PublicHealth on Twitter in R

Hello Readers,

In this post we examine at a specific hashtag on Twitter: #PublicHealth. When I am not spending time studying analytics, I concentrate on the biostatistics and public health, which introduced me to data science.

Analytics plays an important role in public health, as evidence based decisions rely on proper data gathering and analysis for their scientific rigor. Data and information is the power to understand, especially in our times of technology and communication. For example, Google tapped into their search results related to flu symptoms, and managed to model and predict flu trends, which I covered in these two posts. So using Twitter, we can see snippets of public health mentions in what people are tweeting.

Because Twitter is a real-time micro-blogging site, we can take a snapshot of tweets for a specific user, topic, or time-frame. So here I chose to exhibit the topic #PublicHealth, in light of the ebola epidemic spreading in Western Africa. This particular outbreak is one of worst in history, occurring in Sierra Leone, Guinea, and Liberia with 982 cases and 613 deaths (as of July 17) since March of 2014. The Centers for Disease Control and Prevention (CDC) has sent specialists to track, contain, and inform the locals about the Ebola outbreak.

Ebola Map, by CDC

Returning back to Twitter and #PublicHealth, from the 300 tweets I queried in R, by using text transformation in R, I created a word cloud of the words in the tweets:

#Public Health

The querying from Twitter, text transformation, and word cloud creation code can be found in the Appendix, below. As we investigate the word cloud, aside from "publichealth", we see some terms, such as:

"croakeyblog", a blog about health,
"profkevinfenton", Dr. Kevin Fenton, director of health and wellbeing at Public Health England,
"data",
"human",
"ecigarettes",
"healthcare",
and "ebola" on the top left.

#PublicHealth covers a wide range of subtopics, and also links to many others as well. With increased globalization, and aging populations, both communicable and non-communicable diseases are rising across the world. Here is the World Health Organization (WHO) on immunizations, and reducing preventable deaths.

For those interested in studying or learning more about public health, consider the Johns Hopkins Bloomberg School of Public Health, and the Johns Hopkins Public Health Studies
undergraduate program.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Code Appendix:

Text Transformation Code:

> # load library
> library(tm)
> #
> # transforming function (by Wayne)
> # where x is text and w.stop is stopword vector, 
> # w.keep are words to remove from stopwords
> # Term-Doc default, if false, then Doc-Term
> transform.text <- function(x, w.keep=c(""), w.stop=c(""), TD=TRUE) { 
+   
+   cat("Data has ", length(x), " documents.\n")
+   
+   cat("Beginning text transformation: \n\n")
+   
+   cat("Transforming to Corpus... (1/6)\n")
+   text <- x
+   text.corpus <- Corpus(VectorSource(text))
+   
+   cat("Formatting Corpus... (2/6)\n")
+   # lower case
+   text.corpus <- tm_map(text.corpus, tolower)
+   # remove punctuation
+   text.corpus <- tm_map(text.corpus, removePunctuation)
+   # remove numbers
+   text.corpus <- tm_map(text.corpus, removeNumbers)
+   # remove URLs
+   removeURLs <- function(x) gsub("http[[:alnum:]]*", "", x)
+   text.corpus <- tm_map(text.corpus, removeURLs)
+   # add stopwords w
+   myStopWords <- c(stopwords('english'), w.stop)
+   # remove vector w from stopwords
+   myStopWords <- setdiff(myStopWords, w.keep)
+   # remove stopwords from corpus
+   text.corpus <- tm_map(text.corpus, removeWords, myStopWords)
+   
+   cat("Stemming Words... (3/6)\n")
+   # keep corpus copy for use as dictionary
+   text.corpus.copy <- text.corpus
+   # stem words ####
+   text.corpus <- tm_map(text.corpus, stemDocument)
+   
+   cat("Completing Stems... (4/6)\n")
+   # stem completion ####
+   text.corpus <- tm_map(text.corpus, stemCompletion, 
+                           dictionary=text.corpus.copy)
+   
+   if(TD==TRUE) {
+     cat("Creating T-D Matrix... (5/6)\n")
+     text.TDM <- TermDocumentMatrix(text.corpus,
+                                    control=list(wordLengths=c(1,Inf)))
+     cat("Text Transformed! (6/6)\n\n")
+     return(text.TDM)
+   } else {
+       cat("Creating D-T Matrix... (5/6)\n")
+       # create Doc-Term ####
+       text.DTM <- DocumentTermMatrix(text.corpus, 
+                                      control=list(wordLengths=c(1, Inf)))
+       cat("Text Transformed! (6/6)\n\n")
+       return(text.DTM)
+   } 
+   
+ }

Retrieving #PublicHealth Tweets Code:

> library(twitteR)
> library(tm)
> 
> # load twitter cred ####
> load("cred.Rdata")
> registerTwitterOAuth(cred)
[1] TRUE
> 
> # configure RCurl options
> RCurlOptions <- list(capath=system.file("CurlSSL", "cacert.pem", package = "RCurl"),
+ ssl.verifypeer = FALSE)
> options(RCurlOptions = RCurlOptions)
> 
> # query twitter for #PublicHealth in tweets, n=300 ####
> pH <- searchTwitter("#PublicHealth", n=300, lang="en", 
+                     cainfo=system.file("cacert.pem"))
> save(pH, file="publicHealth.rdata")
> 
> # to data.frame ####
> pH.df <- do.call("rbind", lapply(pH, as.data.frame))
> # use textTransformation function ####
> pH.tdm <- transform.text(pH.df$text, w.stop = c("amp", "rt") ,TD = TRUE)
Data has  300  documents.
Beginning text transformation: 

Transforming to Corpus... (1/6)
Formatting Corpus... (2/6)
Stemming Words... (3/6)
Completing Stems... (4/6)
Creating T-D Matrix... (5/6)
Text Transformed! (6/6)

> # find terms with n > 20
 [1] "advice"          "blogs"           "can"             "climate"        
 [5] "cost"            "croakeyblog"     "data"            "day"            
 [9] "elderly"         "eye"             "falls"           "health"         
[13] "heat"            "heatwave"        "helping"         "herts"          
[17] "india"           "issue"           "jimmcmanusph"    "keep"           
[21] "major"           "need"            "neighbours"      "pheuk"          
[25] "please"          "prevent"         "profkevinfenton" "publichealth"   
[29] "stories"         "support"         "today"           "vulnerable"     
>

Creating the Word Cloud Code:

> # generate word cloud ####
> library(wordcloud)
Loading required package: Rcpp
Loading required package: RColorBrewer

> pH.matrix <- as.matrix(pH.tdm)
> wordFreq.sort <- sort(rowSums(pH.matrix), decreasing=T)
> # wcloud
> set.seed(1234)
> grayLevels <- gray( (wordFreq.sort + 10) / (max(wordFreq.sort) + 10))
> wordcloud(words=names(wordFreq.sort), freq=wordFreq.sort, 
+           min.freq=3, random.order=F, colors=grayLevels)
>

Thursday, March 6, 2014

Sochi #Olympics and #Crimea Tweets in R (and Justin Bieber?!)

Hello Readers,

With the 2014 Olympics in Sochi over, we can reflect upon the social media coverage in the Olympic Games. What were Twitter users tweeting about that involved #Olympics during the games? And read on to see how Justin Bieber pops up (I was confused/surprised too).

In other news across the pond, if you have not been stuck in a cave recently, you might have heard about Russian forces 'occupying' the southern Ukrainian island of Crimea. And here is a series of informative satellite images of Crimea and military force disposition around Ukraine. We will see what Twitter users are talking about concerning #Crimea.

The twitteR package in R has many interesting features which we will use to query tweets from the newly modified Twitter API. So be sure to double check your unique consumer and secret keys, though they should not have changed. Also verify that the URLs used in the OAuth generation includes https://, and not just http://.

Let us see what Twitter users have been tweeting about the Olympics and Crimea. Onward with text analysis!

Querying Tweets

In previous posts I covered retrieving tweets from the Twitter API, and transforming them into documents to be analyzed. So here I will simply show the R code, beginning with establishing a handshake with the Twitter API (use https:// with recent changes).

Setting Up OAuth

During some of the times I queried tweets I found that the number of tweets returned from the API varies from 25 to 199, even if n was set at 300. A way around was to query multiple times and join the resulting tweets together. I had no problem with the Olympics tweets however, but they were queried weeks ago during the games.

Give Me More Tweets!

For the Olympics tweets the code would simply be:

olympics <- searchTwitter("#Olympics", n=300)

Text Transformations

Next we would convert the lists to data.frames, and then to a text corpus. After they are in a text corpus, we can transform the text so we can query the words effectively. The crimea.c transformations similar to the olympics.c so they are omitted.

Raw Lists to Transformed Text Corpus

Now we are ready to normalize the text so that we can count their frequencies, and turn them into term document matrices.

Word Stemming and Term Doc Matrix

From the term document matrix, we can calculate the frequency of terms across all tweets. For example, from the crimea.tdm tweets, we can print the words occurring more than 20 times:

High Frequency Words in Crimea Tweets

The next step would be to visualize these frequencies in a word cloud. #Olympics is up first.

#Olympics

In a previous post, we created a word cloud visualizing words from @nbastats. Here we shall do the same for the trend #Olympics.

Creating a Word Cloud

And here it is!

#Olympics Word Cloud

We see olympics, gold, sochi, and other strange terms such as jeremybieber (what!?). Who is Jeremy Bieber? An athlete? I had no idea, so I Google searched him, and he is definitely not an athlete. Apparently he is the father of 'notorious' pop star Justin Bieber. Some celebrity drama was unfolding during the Olympic Games and people flew to Twitter to comment. But it was weird (for me anyways) that music celebrities would be included in the same tweet tagged with #Olympics.

Upon more digging with Google, I found possible reasons: ice hockey and Canada. So the Beibers are originally from Canada and many Justin Bieber 'haters' want him to go back to Canada. With the American and Canadian hockey teams facing off at the Olympic semi-finals, this billboard popped up in Chicago:

Loser Keeps Bieber

So there was the cause for all the hullabaloo on Twitter. Unfortunately, so the 'haters', Justin stays because the USA men's hockey team lost to Canada's team 1-0 in the semi-finals. I could not make this up.

#Crimea

For Crimea (Bieber is not Ukrainian too, is he?), the word cloud code is the same, except change the *.tdm to crimea.tdm. And here is is:

#Crimea Word Cloud

We see many terms which associate with Crimea, such as ukraine, russia, putin, referendum, kiev, and etc., and also some Ukrainian terms as well- kyiv (Kiev, the Ukrainian capital). The majority of people in Crimea are Russian, while many in Western Ukraine are Ukrainian, wanting to join the EU. Crimea previously held a referendum for joining Russia in 2004 and would held another one today March 6th, 2014. The local lawmakers in Crimea voted unanimously in a referendum to join Russia, and would hold a regional vote in 10 days. For a video on the history of Crimea/Ukraine/Russia, click here.

I thought the terms in #Crimea were more logical and politically relevant than terms in #Olympics, although it was amusing to see Justin and his dad mentioned.

___________________________

Hopefully this post shows you how Twitter keywords or trends can be analyzed and visualized, especially when current events are concerned. It is near real-time text data of what people thinking about, and it is easy to analyze the tweets using R. Stay tuned for more R posts!

As always, thanks for reading,

Wayne
@beyondvalence

Friday, January 3, 2014

Text Mining: 2. Converting Tweet Text List into a Document Corpus with Transformations Using R

Welcome Back Readers,

Last time in the Text Mining Series we discussed how to retrieve tweets from the Twitter API with the twitteR package in R. Now that we have the text data to work with, we can now transform the text from their raw format into a corpus, which is a collection of text documents.

This post continues from where we left off in Text Mining 1. Read Text Mining 3 here. We require the packages: tm for text mining and SnowballC for word stemming to collapse words.

Now is a good time to download and load them in R- so let us get started!

Data.Frame to Corpus

As we saw from last time, the raw text in the tweet list object was far from polished and would take some formatting to clean. We start with converting the tweet list object into a data.frame after we load the packages and set the working directory:

Tweet List Object into Data.Frame Object

As we can see, the dimensions of the tweets data.frame has 186 rows and 16 columns. The descriptions for each column are shown below:

Tweet Data.Frame Column Structure

We have columns for the tweet text, whether the tweet is favorited by the user (my account since my app accessed it), how many favorites if any, tweet ID, and so on. Note at the bottom, there are yes/no for longitude and latitude coordinates- so we can track the location where the tweet was sent.

Now that the data exists as a data.frame, we can convert it into a corpus using the Corpus() function from the tm package (the link opens to a pdf of the tm package).

Tweets into Corpus

Observe that the corpus list has 186 text documents, which means that the conversion process converted each tweet into a text document.

Next, we will transform the tweet corpus and the getTransformations() function displays the text transformations available for the tm_map() function in the text mining package. We have various transformations at our disposal, such as removing punctuation and removing numbers, words, and white space.

Transforming Text

First we will perform the following modifications using the tm_map() function: lowering character case, removing punctuation, removing numbers, and removing URLs. These are completed using the respective commands: tolower, removePunctuation, removeNumbers, and a regular expression substitution using gsub().

Removing Case, Punctuation, Numbers and URLs

The gsub() function allows us to replace the regular expression for an URL link, "http[[:alum:]]*" with "", a non-space for removal. Then we pass that function to the tm_map().

Adding and Removing Stopwords

Next we add stopwords. They are words that are ignored from searches with specific relevant topics and keyword combinations, increase the 'content' of the the tweet, and efficiency of analysis. They include, a, the, for, from, who, about, what, when, where, and more, depending on the analysis. Common stopwords, some even from Google Search filters out can be found here.

We can observe the changes of the text transformation, when we look at a random tweet. Say tweet #4 below. Note how all the punctuation was removed, as with the numbers, and newline escapes "\n", and that all characters are lowercase.

Tweet 4 Before and After

Next time around we shall discuss stemming the documents in the corpus we created and transformed in this post. Stemming involves cutting words shorter to retrieve their radicals. That way we can count their frequency despite the word being spelled differently or being used in a different tense. For example- updated, updating, update are 'stemmed' to updat.

And that is for next time! So stay tuned!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Wednesday, January 1, 2014

Text Mining: 1. Retrieving Text from Twitter in R Using the twitteR Package Updated

## Updated 2014/3/20 ##
# Twitter API Changed #

Hello Readers,

Happy 2014!

As a micro-blogging site, Twitter has a wealth of text data to analyze. However, the text data is not in a usable form initially. So it requires extraction from the Twitter API, transformation into a document-matrix, and finally we can perform word associations or counts. Lastly, we can cluster words and tweets into groups to discover the group composition of the words and tweets. We will begin to cover these topics one by one in the Text Mining Series. Read Text Mining 2 here.

A few posts ago we discussed accessing the Twitter API to mine trending data in Python. Today we will discuss Twitter text retrieval in R. Note that readers will require a Twitter account to access the Twitter API.

Let us begin!

Accessing Twitter

To mine Twitter data, we first need to retrieve the text data from the Twitter API. The package twitteR will allow us access Twitter feeds. A large portion of text data come in the form of tweets. We will choose last 200 tweets from the @nbastats handle to analyze, which features daily basketball statistics. Here is what the Twitter page looks like:

@nbastats Page

The tweets by users are located on the bottom right hand corner of the page. To access the Tweets from other applications, we need a user account to create an application in the developer's page. This way we obtain the proper credentials to access the Twitter API through the twitteR package in R.

In the application page, we will have the OAuth keys which are unique to the application. They will allow applications to authenticate with the Twitter API and access the tweets. Keep the keys secret!

OAuth Keys

Now that we have the keys, start R and load the twitteR package (library(twitteR)). Configure the RCurl options as shown below. The cacert.pem certificate file is required and can be downloaded here to the local working directory in R.

RCurl Options

Then assign the consumer key and consumer secret key to those variables.
SEE UPDATE BELOW- TWITTER API CHANGED.

Keys and the getTwitterOAuth() Function

In the getTwitterOAuth() function, provide the consumer key and the consumer secret key. It will direct us to the Twitter API website with a link and once we click Accept, we are given the verification code.

And once we provide the PIN code, we are given the Access tokens and a verification of the completed handshake.

Handshake Completed

Now that we are verified and connected, let us start querying tweets!

API Access Update:

Instead of using the getTwitterOauth() function, we need to setup manually an OAuth object. Make sure the internet addresses begin with https://, and you have loaded the twitteR, ROAuth, and XML packages. The RCurlOptions need to be set as well.

After you initiate the handshake() function, you will be prompted to enter a code from the dev.twitter page into R. To verify the completion of the handshake, simply call cred.

Retrieving Tweets

One thing to note: the cacert.pem specification is required for every tweet query. So when we use the userTimeline() function, we need to assign the certificate to the cainfo argument, in addition to the user handle and number of tweets. Again, we will be using @nbastats. We will try to retrieve up to 200 tweets from that timeline. Here is what the tweets look like at the beginning of the new year:

Tweets from @nbastats

We can save the nba.tweets file to preserve the data and be able to work with it later. Each tweet is a single line, and the number of tweets are quite burdensome to work with. We can access specific tweets by using brackets "[ ]". Let us access tweets 20 to 25.

Selected Tweets Numbers 20 to 25

Formatting Tweets

We can use text wrapping to make the text output more readable. Create a function that will take arguments for the tweet object name, lower tweet number, and higher tweet number. My version looks something like this:

Tweet Printing Function

Observe that there are 3 arguments and a for loop in the function. I start the output with double brackets indicating the number tweet. We saw before in the output, that even the tweets were from 20 to 25, the brackets still showed from 1 to 6 given the order they were printed. We would like the actual tweet number to be shown.

The strwrap() function wraps a string by breaking lines at word boundaries. The specification was 73 columns in width. The output for the initialized function is shown below for tweets 20 to 25 from nba.tweets. Note how the tweet numbers are printed in brackets and the nbastats prefix was removed from each tweet.

Printed Tweets from @nbastats, Numbers 20 to 25

Now we can explore other Twitter users and what they tweet. Next we will transform the text and create a corpus to build a document-matrix. So stay tuned!

Thanks for reading and have a great New Year,

Wayne
@beyondvalence
LinkedIn

Pages