Pages

Thursday, December 12, 2013

Visualizing Twitter Tokens- Hashtags, Smileys and URLs in R


Hello Readers!

Tweet, Tweet!
Welcome back to the blog. Today we will discuss how to visualize Twitter tokens trends in tweets, specifically:


The data was obtained from the infochimps site, where they also host data sets for other platforms. For analysis I will be using RStudio.

Let us get started.


The Loading


Unpacked Zip File Content
Once we have downloaded and unzipped the 'tokens by month' data set, we can go ahead a read the tsv file (tab separated values) into R. Use the str() function to get an idea of the data structure.


Reading in Total Tokens by Hour tsv
Looking at the structure, we see that there are 3 columns with 67,992 observations consisting of "tweet_url", "smiley", or "hashtag". The X column denotes the 4 digit year, month, day, and hours in 24 hour format, followed by the count in the X1 column. Use the table() function to determine how many token measures we have of each, and also to check for spelling errors.


Token Measures
We see there are 18,401 hashtag measures, 25,137 smiley measures, and 24,454 tweet_url measures. Keep in mind these are not counts- they are just the number of times the tokens were measured in the data set. The actual counts are in column 3, X1. To obtain the total counts for each measure, we use tapply() to apply a function by index. The totals are shown above, with tweet_urls coming in top at 167,819,007! It seems that people are tweeting more internet links than hashtags and smileys put together.


The tokens() Function


Next we write the tokens() function. Keep in mind of the variables we have to track when we separate the all the information pertaining to the three different tokens. At the same time, I want to convert the dates into an usable Date.Time format. Putting it together, the result will be a list containing the date and count of the hashtag, smiley, and tweet_url tokens.

The first part of the function is shown below.


Token Function Part 1
 We start the function by initializing the variables we need to keep track of and using counters to progress the variables along through each successive record (h, u, and s) of their respective time and count variables in the for loop. We take the pertinent data from each type of token into their own variables to put in a list. 

Next, we take the date variables and convert them into Date.Time format using the strptime() function, shown below. Afterwards, we create the list and instruct the function to return the tokens list. Now run the tokens () function.


Token Function Part 2
Finally, use the completed tokens() function to create the t.tokens list.


The Plotting


Now that we have the tokens in a convenient list, we can visualize the token trends with a plot. Naturally, use the plot() function, a sample method is show below.


Plotting the Hashtag Count
 The second section will create the x axis labels consistent with the Date.Time values in the t.tokens$count.h variable. It will plot the year and month for better readability.

Hashtag Count Plot

We will add the two other tokens next using the lines() function. And to finalize the plot, we will add a legend to interpret which line is which token.


Adding Smiley and URL Count Lines

Now we have the finished plot!

Plot of All Three Token Counts Over Time

Note that starting in January of 2009, user activity, especially in URL content are beginning to spike above the previous background levels. Observe that drastic spikes in URL counts in July and October of 2009 also coincided with hashtag and smiley counts as well. The increases in content can be attributed to the increased in Twitter users, starting in 2009. 

Stay tuned for more Twitter analysis!

Thanks for reading!


Wayne

2 comments:

  1. I am new to analytics using twitter, The link provided for downloading the twitter tokens dataset seems to be broken, can you please provide me a alternate link.

    ReplyDelete
  2. I am new to analytics using twitter, The link provided for downloading the twitter tokens dataset seems to be broken, can you please provide me a alternate link.

    ReplyDelete