Loading...
Showing posts with label word frequencies. Show all posts
Showing posts with label word frequencies. Show all posts

Wednesday, September 24, 2014

Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations


Hello Readers,

We settle into Part 4 of the Text Analysis Series with Python by examining frequency distributions, word selections, and collocations. As a refresher, collocations are a sequence of words that occur together unusually often, given individual word frequencies. 

A specific case would be a pair of words forming a collocation in a bigram, such as 'red wine'. Bravo or brava, a trigram sequence would be three words long. 'red' and 'wine' would occur together quite often, as opposed to generic 'the wine', and 'maroon wine' would make little sense. That demonstrates how collocations are resistant to substitutions- because only those certain words carry that meaning, so those specific words are used. And only those words, so they occur quite frequently together.

Before we jump into collocations, let's start with frequency distributions. In the 'nltk' module in Python, we have a number of text corpora available for analysis. Load it, and get yourself comfortable for the ride.


Frequency Distributions


So we did some counting in a previous NLP post. We will count in this post as well, but in a different process. We aim to quantify each unique token in a given text corpus. How many times does 'government' occur in the Inaugural Address corpus? What about 'lol' in the Internet Chat corpus? Use the 'FreqDist()' method! As we see below, 'FreqDist()' takes the text and creates a frequency distribution for the unique tokens, and they aren't all words. They can be periods, parenthesis, commas, etc. 


With the summary of 'fdist1', we observe that it has 19,317 samples, or tokens, and 260,819 total counts, or length of the text.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

#####
# frequency distributions

# create FreqDist object
In [3]: fdist1 = FreqDist(text1)

# summary of FreqDist object, 19317 unique tokens with 260,819 total tokens
In [4]: fdist1
Out[4]: <FreqDist with 19317 samples and 260819 outcomes>

# retrieve set of tokens
In [5]: vocab1 = fdist1.keys()

# display first 10 of set
In [6]: vocab1[:10]
Out[6]:
[u'funereal',
 u'unscientific',
 u'divinely',
 u'foul',
 u'four',
 u'gag',
 u'prefix',
 u'woods',
 u'clotted',
 u'Duck']

# display number of occurrences for 'whale' token
In [7]: fdist1['whale']
Out[7]: 906

# plot first 20 terms
In [8]: fdist1.plot(20, cumulative=False)

Accessing the '.keys()' method, we can assign the unique token to 'vocab1' and look into the first 10 elements. We see words such as 'funereal', 'unscientific', and 'divinely'. Lastly we can look up the count of a word. Take 'whale', since text1 is Moby Dick, and we see that text1 has 906 occurrences of 'whale'.


Lastly, the 'FreqDist' object has a plot function. We specify the number of terms, and whether the plot is cumulative or not, and Python returns Figure 1.


Figure 1. Frequency Distribution of 20 Terms in Text1

Notice how 'four' has the highest count in this sample of 20 terms, with 'hanging' coming in second. The words you see with barely any count have a count of 1- they only occur in the text corpus once! These special terms are named hapaxes (hapax singular). nltk has a special function just for identifying hapaxes. You guessed it, '.hapaxes()'!


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
######
# hapaxes are words that occur only once

# display first 10 occurrences
In [8]: fdist1.hapaxes()[:10]
Out[8]:
[u'funereal',
 u'unscientific',
 u'prefix',
 u'plaudits',
 u'woody',
 u'disobeying',
 u'Westers',
 u'DRYDEN',
 u'Untried',
 u'superficially']
 
# display total number of single occurrences 
In [9]: len(fdist1.hapaxes())
Out[9]: 9002

The first 10 hapaxes can be sliced from the function, and we can also see how many unique terms are in text1 by passing the result to the 'len()' method. We see that text1 has 9002 terms which occur only once.



Word Selections


Now we will take advantage of Python looping through iterable objects to select words with certain attributes of word length and occurrences in a corpus. This way we can search for defining words which capture the essence of a corpus, or track trends in word usage. We can throttle the word count or word length to our needs in different circumstances.


Word Length
We can create an if condition to accept only words with more than 15 characters in the set of unique tokens from text1, as shown below. Some long words which occur are: 'CIRCUMNAVIGATION', 'Physiognomically', and 'apprehensiveness'.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# using set theory to select words
# lengthy words of 15 characters or more
# {w | w E V & P(w)}
# [w for w in V if P(w)]
# the set of all w such that w is an element of V (vocab) and has property P

# get set of vocab in text1
In [10]: V = set(text1)

# iterate through V, grabbing each word with character length greater than 15
In [11]: long_words = [w for w in V if len(w) > 15]

# display sorted first 10 lengthy words
In [12]: sorted(long_words)[:10]
Out[12]:
[u'CIRCUMNAVIGATION',
 u'Physiognomically',
 u'apprehensiveness',
 u'cannibalistically',
 u'characteristically',
 u'circumnavigating',
 u'circumnavigation',
 u'circumnavigations',
 u'comprehensiveness',
 u'hermaphroditical']

#####
# looking at internet long word patterns
# more than 15 characters

# check which text number is internet chat, #5 
In [13]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create unique vocab set
In [14]: vocab = set(text5)

# iterate through vocab for words greater than 15 characters in length
In [15]: long_chat_words = [word for word in vocab if len(word) > 15]

# display first 10 sorted
In [16]: sorted(long_chat_words)[:10]
Out[16]:
[u'!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'#talkcity_adults',
 u'(((((((((((((((((',
 u'((((((((((((((((((',
 u'((((((((((((((((((((']

# display 101st to 110th sorted, no results 
In [17]: sorted(long_chat_words)[100:111]
Out[17]: []

# index from last for last 10
# observe exaggerated chat patterns
In [18]: sorted(long_chat_words)[-10:]
Out[18]:
[u'oooooooooooooonnnnnnnnnnnneeeeeeeeeeeeeeesssssssss',
 u'raaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 u'tatatatnanaantatat',
 u'weeeeeeeeeeeeeeee',
 u'weeeeeeeeeeeeeeeeeeeeeeeeed',
 u'wheeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 u'woooooooooaaaahhhhhhhhhhhh',
 u'wooooooooooooohoooooooooooooooo',
 u'www.Wunderground.com',
 u'yuuuuuuuuuuuummmmmmmmmmmm']

Furthermore, we use the Internet Chat corpus, text5, to examine some words with long length. The first few are simply exclamation points, while the last few are 'overspelled' for dramatic effect.


Word Length and Frequency
With word length, we could consider another attribute to select words from a corpus. We could use the word frequency. Even if a word is long winded, if it occurs more than a few times it could be indicative of an important word in the corpus. So we include the count from the frequency distribution. 

Looking at the Internet Chat corpus (text5) again, we select for the word length to be more than 8, and a frequency higher than 5. Keep in mind we have to create the frequency distribution from text5 first, and use the set of text5 as the iterable variable.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#####
# which words typify a text? 
# the long words or the single occurrences? (hapaxes)
# what about frequently occurring long words?

# display text names
In [19]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create FreqDist object for text5
In [20]: fdist5 = FreqDist(text5)

# sort words iterated through set of text5
# having character length more than 8, and occurring more than 5 times
In [25]: selected_words = sorted([w for w in set(text5) if len(w) > 8 and fdist5[w] > 5])

# display words selected on minimum word length and occurrence
In [26]: selected_words
Out[26]:
[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u')))))))))))',
 u')))))))))))))',
 u'.........',
 u'Compliments',
 u'cute.-ass',
 u'everybody',
 u'everything',
 u'listening',
 u'seriously',
 u'something',
 u'sometimes']

The results vary in content, as Internet Chat is not censored, as you might discover if you delve deeper into the text.



Bigrams & Collocations


Here we arrive at the word pairs and special word pairs. The 'bigrams()' method creates pairings of words as it iterates through the text, combining adjacent words. Collocations pull those word pairs which exist together unusually frequently, and you might find that they have a particular meaning when seen together and are not descriptive when apart.

The bigram for 'more is said than done' is shown below. Note that the adjacent words are paired. For collocations, use the '.collocations()' method on a text corpus to retrieve the list of collocation terms. Looking at text4 Inaugural Address corpus, 'United States', 'fellow citizens', 'four years', and such are mentioned in those combinations more than not. 'Indian tribes', and 'Chief Justice', are terms which occur together.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# collocation : sequence of words that occur together unusually often
# ex: 'red wine', as opposed to 'the wine'
# bigram : word pairs 

# create bigram list
In [34]: bigram1 = list(bigrams(['more','is','said','than','done']))

# display bigram, note pairs of words
In [35]: bigram1
Out[35]: [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

# collocation for text4
# bigrams with words that occur together more frequently 
# than expected based on frequency of individual words
In [37]: text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

# collocations for personals corpus
In [38]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# display personals collocations
In [39]: text8.collocations()
Building collocations list
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

For the personals corpus, text8, we encounter word pairs in personal advertisements online. Likely and logical word combinations such as 'medium build', 'social drinker', 'quiet nights', easy going', 'financially secure', and 'permanent relationship' are paired together due to the nature of personality and lifestyle description. Simply put, those words go together, like peanut butter and jelly, although I am a 'peanut butter will do' person. The pair 'peanut butter' would be a collocation in a baking recipe or snacks corpus.


Wow, there goes another post! I know there was only one graph this time, so thanks for making it this far. Here we learned about frequency distributions, different ways to select words from a text corpus, and word selection tools to create bigrams and collocations in Python. This is just the beginning, and there is lots more text analysis to cover! So stay tuned.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

Thursday, March 6, 2014

Sochi #Olympics and #Crimea Tweets in R (and Justin Bieber?!)


Hello Readers,


With the 2014 Olympics in Sochi over, we can reflect upon the social media coverage in the Olympic Games. What were Twitter users tweeting about that involved #Olympics during the games? And read on to see how Justin Bieber pops up (I was confused/surprised too).

In other news across the pond, if you have not been stuck in a cave recently, you might have heard about Russian forces 'occupying' the southern Ukrainian island of Crimea. And here is a series of informative satellite images of Crimea and military force disposition around Ukraine. We will see what Twitter users are talking about concerning #Crimea.

The twitteR package in R has many interesting features which we will use to query tweets from the newly modified Twitter API. So be sure to double check your unique consumer and secret keys, though they should not have changed. Also verify that the URLs used in the OAuth generation includes https://, and not just http://. 

Let us see what Twitter users have been tweeting about the Olympics and Crimea. Onward with text analysis!


Querying Tweets



In previous posts I covered retrieving tweets from the Twitter API, and transforming them into documents to be analyzed. So here I will simply show the R code,  beginning with establishing a handshake with the Twitter API (use https:// with recent changes). 



Setting Up OAuth

During some of the times I queried tweets I found that the number of tweets returned from the API varies from 25 to 199, even if n was set at 300. A way around was to query multiple times and join the resulting tweets together. I had no problem with the Olympics tweets however, but they were queried weeks ago during the games.


Give Me More Tweets!
For the Olympics tweets the code would simply be:

olympics <- searchTwitter("#Olympics", n=300)


Text Transformations



Next we would convert the lists to data.frames, and then to a text corpus. After they are in a text corpus, we can transform the text so we can query the words effectively. The crimea.c transformations similar to the olympics.c so they are omitted.


Raw Lists to Transformed Text Corpus
Now we are ready to normalize the text so that we can count their frequencies, and turn them into term document matrices.


Word Stemming and Term Doc Matrix

From the term document matrix, we can calculate the frequency of terms across all tweets. For example, from the crimea.tdm tweets, we can print the words occurring more than 20 times:


High Frequency Words in Crimea Tweets

The next step would be to visualize these frequencies in a word cloud. #Olympics is up first.


#Olympics



In a previous post, we created a word cloud visualizing words from @nbastats. Here we shall do the same for the trend #Olympics.



Creating a Word Cloud

And here it is!


#Olympics Word Cloud

We see olympics, gold, sochi, and other strange terms such as jeremybieber (what!?). Who is Jeremy Bieber? An athlete? I had no idea, so I Google searched him, and he is definitely not an athlete. Apparently he is the father of 'notorious' pop star Justin Bieber. Some celebrity drama was unfolding during the Olympic Games and people flew to Twitter to comment. But it was weird (for me anyways) that music celebrities would be included in the same tweet tagged with #Olympics.


Upon more digging with Google, I found possible reasons: ice hockey and Canada. So the Beibers are originally from Canada and many Justin Bieber 'haters' want him to go back to Canada. With the American and Canadian hockey teams facing off at the Olympic semi-finals, this billboard popped up in Chicago:


Loser Keeps Bieber
So there was the cause for all the hullabaloo on Twitter. Unfortunately, so the 'haters', Justin stays because the USA men's hockey team lost to Canada's team 1-0 in the semi-finals. I could not make this up.



#Crimea



For Crimea (Bieber is not Ukrainian too, is he?), the word cloud code is the same, except change the *.tdm to crimea.tdm. And here is is:


#Crimea Word Cloud

We see many terms which associate with Crimea, such as ukraine, russia, putin, referendum, kiev, and etc., and also some Ukrainian terms as well- kyiv (Kiev, the Ukrainian capital). The majority of people in Crimea are Russian, while many in Western Ukraine are Ukrainian, wanting to join the EU. Crimea previously held a referendum for joining Russia in 2004 and would held another one today March 6th, 2014. The local lawmakers in Crimea voted unanimously in a referendum to join Russia, and would hold a regional vote in 10 days. For a video on the history of Crimea/Ukraine/Russia, click here.

I thought the terms in #Crimea were more logical and politically relevant than terms in #Olympics, although it was amusing to see Justin and his dad mentioned.

___________________________

Hopefully this post shows you how Twitter keywords or trends can be analyzed and visualized, especially when current events are concerned. It is near real-time text data of what people thinking about, and it is easy to analyze the tweets using R. Stay tuned for more R posts!



As always, thanks for reading,


Wayne
@beyondvalence

Monday, January 6, 2014

Text Mining: 4. Performing Term Associations and Creating Word Clouds in R


Hello Readers,




Welcome back to my blog. Today we will discuss analysis of a term document matrix that we created in the last post of the Text Mining Series. 

We will perform frequent term searches, and terms associations with visualizations. Then we finish the post by creating a visual word cloud (to the right) to 'display' the content of the terms in the tweets from @nbastats. Read Part 5 here.


Start R and let us begin programming!




Plotting Word Frequencies



Here we continue from where we left off last time. Begin by loading the
twitteR, tm, and ggplot2 packages in R. Using the findFreqTerms() function in the tm library to find the most frequent terms. We can specify the lower and upper bounds of the frequency values using lowfreq and highfreq. Here we return terms with 20 or more occurrences.


High Frequency Terms

Next we take the 17 terms and create a bar graph of the frequencies using
ggplot2. We can obtain the term counts by using rowSums() and we subset the sums to return values 20 or greater. Then we can plot the graph using qplot(), and the geom="bar" argument will create the bar graph, the coord_flip() flips the x and y axis.


Bar Graph Code

The neat result is shown below:


Term Frequencies

We can see that "games", "last", and "amp" are the top three terms by frequency.



Finding Word Associations



Using word associations, we can find the correlation between certain terms and how they appear in tweets (documents). We can perform word associations with the findAssocs() function. Let us find the word associations for "ppg" (points per game) and return the word terms with correlations higher than 0.25.


Associated Terms for "ppg"

We see that "ppg" has a high 0.65 correlation with "rpg", or rebounds per game. This makes sense as a tweet which contacts statistics about the points per game would also include other statistics like rebounds as well as "apg"- assists per game and "fg" field goals.


What about a team, say the Heat which has LeBron James? We can find the word associations for "heat":


Word Associations for "heat"

The top correlated terms are "adjusts", "fts" (free throws), "rockets", "thunder", and "value". Both the Houston Rockets and the Oklahoma City Thunder at top teams so it makes sense that there would be mentions of top teams in the same tweet, especially if they play against each other. LeBron is having a record year in player efficiency which might be why "trueshooting" is an associated term with 0.49 correlation for true shooting percentage.


We can plot the word associations for "heat". The code, similar to the previous plot, is shown below.


Plotting "heat" Word Associations

Which yields:


"heat" Word Associations

And there we have the word associations for the term "heat". I think that is a nice looking visual.



Creating a Word Cloud



We are going to continue the visual creation spree, and this time we will create a word cloud. Load the
wordcloud package in R and convert the tweet term document into a regular matrix. 


wordcloud, Matrix Conversion, and Sorted Word Frequencies

After we created a word count of all the terms and sorted them in descending order, we can proceed to making the word cloud. We will set a set seed (at 1234) so that the work is reproducible. We need to create a gradient of colors (this time in gray) ranging from 0 to 1 for the cloud, based on the frequency of the word. More frequent terms will have darker font in the word cloud.

With the wordcloud() function, we can create the word cloud. We need to specify the words, their frequencies, a minimum frequency of a term for inclusion, and the color of the words in the cloud.


Word Cloud Plot Code

The word cloud is shown below!


Word Cloud for @nbastats

So in this post we have performed word associations for high frequency terms in our tweet term document matrix. We have also created a visualizations for the word associations and a word cloud to visualize the top occurring terms in the term document matrix for tweets from @nbastats.


Stay tuned for more analysis for hierarchical clustering of the terms!


Thanks for reading,

Wayne

@beyondvalence
LinkedIn

Sunday, January 5, 2014

Text Mining: 3. Stemming Text and Building a Term Document Matrix in R


Hello Readers,



In our last post in the Text Mining Series, we talked about converting a Titter tweet list object into a text corpus- a collection of text documents, and we transformed the tweet text to prepare it for analysis. 

However, the tweets corpus is not quite done. We are going to stem the tweets and build a document matrix which will enable us to perform frequent term searches and word associations. So now we will continue where we left off from Text Mining 2. Read Text Mining 4 here.

Start R and let us begin!



Stemming Text



Here we have one more step before we start with the basics of counting word frequencies. Since we are looking at the @nbastats handle, consider the word score. It can exist in many other forms, such as scoring, scored, or scores and so on. When we want to count the number of score terms, we would like all the variations to be counted.

So stemming score and other words will allow us to count different variations as the same term. How is this possible? Stemming will truncate words to their radical form so that score, scoring, scored, or scores will turn into 'scor'. Think of stemming as linguistic normalization. That way different spelling occurrences will appear as the same term, and will be counted towards the term frequency.

In R, load the SnowballC package, along with twitteR, and tm. The SnowballC package will allow us to stem our text of tweets. The tweets corpus data was created in the previous post. 

Load Packages and Tweets Corpus

After creating a copy of the tweets corpus for a reference prior to the stemming changes, we can proceed with the stemming. With the tm_map() function, use the stemDocument transformation. Inspect the documents in the text corpus with the inspect() function.


Stemming and Inspecting the Document

Check the pre-stemmed tweet:


Copy of Tweet 4

See how the 'leaders' and 'entering' in the copy corpus were stemmed to 'leader' and 'enter' in the transformed corpus. Like the function we created with string wrapping for printing tweets from the tweets list in Retrieving Twitter Tweets. Here is the one for printing tweets from a stemmed text corpus:


Print Tweets from Corpus Function

The output for tweets 4 through 9 are shown below.


Tweet Corpus Documents 4 to 9

Next instead of stemming, we will have to implement stem completion. 



Stem Completion



We complete the stems with the objective to reverse the stemming process so that the text looks more 'normal' and readable. By passing the tm_map() the stemCompletion() function, we can complete the stems of the documents in tweets.corpus, with the tweets.corpus.copy as the dictionary. The default option is set to complete the match with the highest frequency term.


Stem Completion Function

Then we can call the function to print out the tweets for 4 through 9 again, and the output for before and after stem completion are stacked below:


Text Corpus Stem Completion Comparison

There are several things to note. Starting at the beginning of tweet 4, we can see that 'leader' was replaced with 'leaders', 'enter' with 'entering', 'deandr' with 'deandre' (DeAndre Jordan the center for the Los Angeles Clippers), 'cousin' with 'cousins' (DeMarcus Cousins, the center for the Sacramento Kings).

However, there are some major differences, such as 'carmeloanthoni' being completely omitted after the stem completion (sorry Melo)- although 'kingjam' was replaced with 'kingjames' (congrats LeBron James). 'statist' was completed to 'statistical'. 'millsap' was incorrectly completed to 'millsaps' (correct as Millsap- sorry Paul). Some shorthand was completed as well- 'reb' was completed to 'rebs' for rebounds. Looking at 'effici' it was completed to 'efficient'.

From the last tweet (9), we see that 'd' was completed to 'defrtg' (short for defensive rating), although that it was just a regular 'd' to begin with. The 'okcthund' was completed to a more familiar 'okcthunder' (Oklahoma City Thunder).

So not all completions did a phenomenal job. That is to be expected when we use the default of completing the stemmed matches with the highest frequency in the dictionary, the copy of the corpus.


Building a Term-Document Matrix



Now that we have stemmed and completed our text corpus, we can proceed to build a term-document matrix. Load the
tm and twitteR packages, and continue using the stem completed corpus.


Loading Packages and Text Corpus

In a term document matrix, the terms are the rows and the documents are in the columns. We can also create a document term matrix, and that would simply be the transpose of the term document matrix- the documents are represented by the rows and the terms by the columns. With the
TermDocumentMatrix() function, we can create the term document matrix from our tweet text corpus. See in the output below, that there are 606 terms and 186 documents (tweets).


Using the TermDocumentMatrix() Function

There is also a 98% sparsity, which means that 98% of the entries are 0, so the row term does not exist in a particular document column. We can display the terms by calling the term dimension names. It is quite extensive a only a fraction of the 606 terms is show below.



Terms

Observe some (basketball) recognizable terms such as "careerhigh", " blakegriffin", or "chicagobulls"- but which term had the most occurrences in all the tweet documents? We can figure that out using the apply() function and summing across the rows of the term document.

Most Occurring Term: games, 79 times

We can also tweak the function to show the terms with more than 20 occurrences by passing it through the which() function.


Terms With 20 + Occurrences

See that some of the top terms are "asts" assists, "fg" field goals, "games", "okcthunder" Oklahoma City Thunder, and "ppg" points per game among the various terms.

Let us visualize the term document matrix for specific terms and documents. We can assign the occurrences of the most frequent term, "games" in an index, and the following few terms for specific tweet documents. Again, we will use the
inspect() function to gain snippets of the term document matrix.


Term Frequencies in Tweet Documents 60 to 70

Note that the sparsity is 87%, so like the rest of the matrix, this sub-matrix is mostly comprised of zero entries. Looking at the frequent "games" term, we do see that particular term appearing in the documents quite often, and even twice in tweet document 64. The abbreviation "gms" is also shown as well.




At the beginning we started with a corpus of tweet text, and we stemmed it, completed the stems, and converted it into a term document matrix. From the TD matrix we were able to count the frequencies of the terms in each document, bringing the data a final step closer to analysis-ready. 

In the next post we will begin to analyze the text data, and we shall demonstrate term frequencies, term associations, and word clouds. Stay tuned!



Thanks for reading,


Wayne
@beyondvalence
LinkedIn