Valence Analytics: text analysis nltk

Showing posts with label text analysis nltk. Show all posts

Wednesday, September 24, 2014

Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations

Hello Readers,

We settle into Part 4 of the Text Analysis Series with Python by examining frequency distributions, word selections, and collocations. As a refresher, collocations are a sequence of words that occur together unusually often, given individual word frequencies.

A specific case would be a pair of words forming a collocation in a bigram, such as 'red wine'. Bravo or brava, a trigram sequence would be three words long. 'red' and 'wine' would occur together quite often, as opposed to generic 'the wine', and 'maroon wine' would make little sense. That demonstrates how collocations are resistant to substitutions- because only those certain words carry that meaning, so those specific words are used. And only those words, so they occur quite frequently together.

Before we jump into collocations, let's start with frequency distributions. In the 'nltk' module in Python, we have a number of text corpora available for analysis. Load it, and get yourself comfortable for the ride.

Frequency Distributions

So we did some counting in a previous NLP post. We will count in this post as well, but in a different process. We aim to quantify each unique token in a given text corpus. How many times does 'government' occur in the Inaugural Address corpus? What about 'lol' in the Internet Chat corpus? Use the 'FreqDist()' method! As we see below, 'FreqDist()' takes the text and creates a frequency distribution for the unique tokens, and they aren't all words. They can be periods, parenthesis, commas, etc.

With the summary of 'fdist1', we observe that it has 19,317 samples, or tokens, and 260,819 total counts, or length of the text.

Code:

Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

#####
# frequency distributions

# create FreqDist object
In [3]: fdist1 = FreqDist(text1)

# summary of FreqDist object, 19317 unique tokens with 260,819 total tokens
In [4]: fdist1
Out[4]: <FreqDist with 19317 samples and 260819 outcomes>

# retrieve set of tokens
In [5]: vocab1 = fdist1.keys()

# display first 10 of set
In [6]: vocab1[:10]
Out[6]:
[u'funereal',
 u'unscientific',
 u'divinely',
 u'foul',
 u'four',
 u'gag',
 u'prefix',
 u'woods',
 u'clotted',
 u'Duck']

# display number of occurrences for 'whale' token
In [7]: fdist1['whale']
Out[7]: 906

# plot first 20 terms
In [8]: fdist1.plot(20, cumulative=False)

Accessing the '.keys()' method, we can assign the unique token to 'vocab1' and look into the first 10 elements. We see words such as 'funereal', 'unscientific', and 'divinely'. Lastly we can look up the count of a word. Take 'whale', since text1 is Moby Dick, and we see that text1 has 906 occurrences of 'whale'.

Lastly, the 'FreqDist' object has a plot function. We specify the number of terms, and whether the plot is cumulative or not, and Python returns Figure 1.

Figure 1. Frequency Distribution of 20 Terms in Text1

Notice how 'four' has the highest count in this sample of 20 terms, with 'hanging' coming in second. The words you see with barely any count have a count of 1- they only occur in the text corpus once! These special terms are named hapaxes (hapax singular). nltk has a special function just for identifying hapaxes. You guessed it, '.hapaxes()'!

Code:

######
# hapaxes are words that occur only once

# display first 10 occurrences
In [8]: fdist1.hapaxes()[:10]
Out[8]:
[u'funereal',
 u'unscientific',
 u'prefix',
 u'plaudits',
 u'woody',
 u'disobeying',
 u'Westers',
 u'DRYDEN',
 u'Untried',
 u'superficially']
 
# display total number of single occurrences 
In [9]: len(fdist1.hapaxes())
Out[9]: 9002

The first 10 hapaxes can be sliced from the function, and we can also see how many unique terms are in text1 by passing the result to the 'len()' method. We see that text1 has 9002 terms which occur only once.

Word Selections

Now we will take advantage of Python looping through iterable objects to select words with certain attributes of word length and occurrences in a corpus. This way we can search for defining words which capture the essence of a corpus, or track trends in word usage. We can throttle the word count or word length to our needs in different circumstances.

Word Length
We can create an if condition to accept only words with more than 15 characters in the set of unique tokens from text1, as shown below. Some long words which occur are: 'CIRCUMNAVIGATION', 'Physiognomically', and 'apprehensiveness'.

Code:

# using set theory to select words
# lengthy words of 15 characters or more
# {w | w E V & P(w)}
# [w for w in V if P(w)]
# the set of all w such that w is an element of V (vocab) and has property P

# get set of vocab in text1
In [10]: V = set(text1)

# iterate through V, grabbing each word with character length greater than 15
In [11]: long_words = [w for w in V if len(w) > 15]

# display sorted first 10 lengthy words
In [12]: sorted(long_words)[:10]
Out[12]:
[u'CIRCUMNAVIGATION',
 u'Physiognomically',
 u'apprehensiveness',
 u'cannibalistically',
 u'characteristically',
 u'circumnavigating',
 u'circumnavigation',
 u'circumnavigations',
 u'comprehensiveness',
 u'hermaphroditical']

#####
# looking at internet long word patterns
# more than 15 characters

# check which text number is internet chat, #5 
In [13]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create unique vocab set
In [14]: vocab = set(text5)

# iterate through vocab for words greater than 15 characters in length
In [15]: long_chat_words = [word for word in vocab if len(word) > 15]

# display first 10 sorted
In [16]: sorted(long_chat_words)[:10]
Out[16]:
[u'!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'#talkcity_adults',
 u'(((((((((((((((((',
 u'((((((((((((((((((',
 u'((((((((((((((((((((']

# display 101st to 110th sorted, no results 
In [17]: sorted(long_chat_words)[100:111]
Out[17]: []

# index from last for last 10
# observe exaggerated chat patterns
In [18]: sorted(long_chat_words)[-10:]
Out[18]:
[u'oooooooooooooonnnnnnnnnnnneeeeeeeeeeeeeeesssssssss',
 u'raaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 u'tatatatnanaantatat',
 u'weeeeeeeeeeeeeeee',
 u'weeeeeeeeeeeeeeeeeeeeeeeeed',
 u'wheeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 u'woooooooooaaaahhhhhhhhhhhh',
 u'wooooooooooooohoooooooooooooooo',
 u'www.Wunderground.com',
 u'yuuuuuuuuuuuummmmmmmmmmmm']

Furthermore, we use the Internet Chat corpus, text5, to examine some words with long length. The first few are simply exclamation points, while the last few are 'overspelled' for dramatic effect.

Word Length and Frequency
With word length, we could consider another attribute to select words from a corpus. We could use the word frequency. Even if a word is long winded, if it occurs more than a few times it could be indicative of an important word in the corpus. So we include the count from the frequency distribution.

Looking at the Internet Chat corpus (text5) again, we select for the word length to be more than 8, and a frequency higher than 5. Keep in mind we have to create the frequency distribution from text5 first, and use the set of text5 as the iterable variable.

Code:

#####
# which words typify a text? 
# the long words or the single occurrences? (hapaxes)
# what about frequently occurring long words?

# display text names
In [19]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create FreqDist object for text5
In [20]: fdist5 = FreqDist(text5)

# sort words iterated through set of text5
# having character length more than 8, and occurring more than 5 times
In [25]: selected_words = sorted([w for w in set(text5) if len(w) > 8 and fdist5[w] > 5])

# display words selected on minimum word length and occurrence
In [26]: selected_words
Out[26]:
[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u')))))))))))',
 u')))))))))))))',
 u'.........',
 u'Compliments',
 u'cute.-ass',
 u'everybody',
 u'everything',
 u'listening',
 u'seriously',
 u'something',
 u'sometimes']

The results vary in content, as Internet Chat is not censored, as you might discover if you delve deeper into the text.

Bigrams & Collocations

Here we arrive at the word pairs and special word pairs. The 'bigrams()' method creates pairings of words as it iterates through the text, combining adjacent words. Collocations pull those word pairs which exist together unusually frequently, and you might find that they have a particular meaning when seen together and are not descriptive when apart.

The bigram for 'more is said than done' is shown below. Note that the adjacent words are paired. For collocations, use the '.collocations()' method on a text corpus to retrieve the list of collocation terms. Looking at text4 Inaugural Address corpus, 'United States', 'fellow citizens', 'four years', and such are mentioned in those combinations more than not. 'Indian tribes', and 'Chief Justice', are terms which occur together.

Code:

# collocation : sequence of words that occur together unusually often
# ex: 'red wine', as opposed to 'the wine'
# bigram : word pairs 

# create bigram list
In [34]: bigram1 = list(bigrams(['more','is','said','than','done']))

# display bigram, note pairs of words
In [35]: bigram1
Out[35]: [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

# collocation for text4
# bigrams with words that occur together more frequently 
# than expected based on frequency of individual words
In [37]: text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

# collocations for personals corpus
In [38]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# display personals collocations
In [39]: text8.collocations()
Building collocations list
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

For the personals corpus, text8, we encounter word pairs in personal advertisements online. Likely and logical word combinations such as 'medium build', 'social drinker', 'quiet nights', easy going', 'financially secure', and 'permanent relationship' are paired together due to the nature of personality and lifestyle description. Simply put, those words go together, like peanut butter and jelly, although I am a 'peanut butter will do' person. The pair 'peanut butter' would be a collocation in a baking recipe or snacks corpus.

Wow, there goes another post! I know there was only one graph this time, so thanks for making it this far. Here we learned about frequency distributions, different ways to select words from a text corpus, and word selection tools to create bigrams and collocations in Python. This is just the beginning, and there is lots more text analysis to cover! So stay tuned.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

Friday, September 5, 2014

Natural Language Processing in Python: Part 3. Indexing Lists

Hello Readers,

Here we continue the Text Analysis in Python blog series by examining list manipulation. Keeping in mind we work with large amounts of text, we increase our efficiency by knowing how to manipulate it in Python. Working with lists is crucial towards natural language processing. As usual, we will use the nltk module and IPython. Keep an eye out for Monty Python.

Start IPython from the command line and let us begin.

Sentences as Lists

Previously when we worked with nltk we used the texts and terms inside them. However, nltk also provides the first sentence from each of the 9 texts as variables we can use as well (sent1, sent2... sent9). The sentences are of list type, which are enclosed in brackets [ ]. Each element in a list can be composed of strings, numbers, even other lists.

So by typing the "sent" and the sentence number, we can print the tokens in the list.

Code:

C:\Users\wayne>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [29]: sents()
sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a clo
ud of sunset .

In [2]: sent1
Out[2]: ['Call', 'me', 'Ishmael', '.']

Modifying Lists

We can modify the list by appending elements, which add those elements to the end of the list. Using ".append()" we can change the list to our liking, and also use an operation called concatenation, where we join two lists together with a "+" sign. It is like addition for lists and strings.

Code:

# appending

In [3]: sent1.append('hello there')

In [4]: sent1
Out[4]: ['Call', 'me', 'Ishmael', '.', 'hello there']

In [5]: sent1[::-1]
Out[5]: ['hello there', '.', 'Ishmael', 'me', 'Call']

# concatenation 

In [19]: new = sent2+sent3

In [20]: new
Out[20]:
['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

As we can see, appending 'hello there' simply adds it to the end of "sent1". Calling "sent1", again we see the extra element at the end, and the change is permanent. With concatenation, we join "sent2" and "sent3" together with "+", and we end up with two mashed first sentences in one list.

List Indexes

Why do Python indexes start at zero?

Remember that in Python, lists indexes start at zero, and the last element is n-1. It is not because computer scientists want to confuse people. This is due to how Python reads the lists. Think of it this way: when Python first accesses the list, the first element is 'read', and to read the next (second) element, Python has to move once, and to read the third element, Python has to move again. Reading the third element Python has moved twice, so the third element index is 2. So by default (0), the pointer is aimed at first element, the indexes represents the number of iterations Python must move. For example, to reach the fifth element, Python must move 4 times.

We can use this concept to find indexes of certain tokens, or retrieve elements in certain indexes in lists. In index 15 of "text1", or the 16th token- not word because tokens include punctuation- is the word 'Consumptive' or someone likely with tuberculosis. Note that when we use ".index('Consumptive)", it finds the first occurrence of the word and returns its position.

Code:

In [11]: text1[15]
Out[11]: u'Consumptive'

In [12]: text1.index('Consumptive')
Out[12]: 15

In [13]: text1.index('consumptive')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-e45cff026a60> in <module>()
----> 1 text1.index('consumptive')

C:\Python27\lib\site-packages\nltk\text.pyc in index(self, word)
    369         Find the index of the first occurrence of the word in the text.
    370         """
--> 371         return self.tokens.index(word)
    372
    373     def readability(self, method):

ValueError: 'consumptive' is not in list

However, because it is case sensitive, 'consumptive' is not the same as 'Consumptive' so finding it's index will not return 15. Instead it returns an ValueError because there are no lowercase consumptives tokens in the text!

Slicing Lists

With slicing, we can take sections of elements from the text lists, by denoting the start and stop index points. The stop index point indicates index limit, and will not print out that index. So for an slice of index from 10 to 20, it will return elements from index 10 to 19- starting at 10 and up to 20. Below we take the index of Monty Python and the Holy Grail in "text6", with starting index of 2301 up to 2320.

Code:

# slicing

In [14]: text6
Out[14]: <Text: Monty Python and the Holy Grail>

In [15]: text6[2301:2320]
Out[15]:
[u'for',
 u'no',
 u'man',
 u'.',
 u'ARTHUR',
 u':',
 u'So',
 u'be',
 u'it',
 u'!',
 u'ARTHUR',
 u'and',
 u'BLACK',
 u'KNIGHT',
 u':',
 u'Aaah',
 u'!,',
 u'hiyaah',
 u'!,']

# index starts at zero
 
 In [16]: sent6
Out[16]:
['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa',
 'there',
 '!']

In [17]: sent6[0]
Out[17]: 'SCENE'

In [18]: sent6[1]
Out[18]: '1'

Reading the output list, we notice this is where King Arthur encounters and duels the Black Knight:

Also, the first sentence in Monty Python introduces the (in)famous coconut horses and King Arthur.

More Slicing

There is more than one way to slice bread, and likewise there is more than one way to slice a list. For example, Python allows us to use negative indexes. Negative, you might ask? It is counter-intuitive, but simple to reorient yourself with negative indexes by starting at the end of the list. When you think of the first element as zero, simply going negative takes you to the other end- the last element. So negative indexes start from -1 to -n, where n is the length of the list, and -n is the first element (index 0).

Using the same previous Monty Python sentence, we find the last index by subtracting 1 from the length of the list. To confirm that an index of -1 is indeed the last element in the list, we compare it to our "!" result. And yes, they are the same. Taking the second to last element gives us "there" which is the correct element.

Code:

# slicing indexes - last and negative values

In [21]: len(sent6)
Out[21]: 17

In [22]: sent6[16]
Out[22]: '!'

In [23]: sent6[-1]
Out[23]: '!'

In [24]: sent6[-2]
Out[24]: 'there'

# slicing negative values

In [26]: sent6[-2:]
Out[26]: ['there', '!']

In [27]: sent6[:-2]
Out[27]:
['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa']

What if we take a slice using a negative starting point? For example, if we choose -2 as our starting index, what will Python return? Take a look at the code above. Python returns the second to last result to the end of the list. And what about a negative index for the ending point? Just what you guessed. It returns the elements at beginning of the list, all the way up to, but not including, the second to last element, "there".

Replacing Elements

We can assign specific values to elements in lists using the "=" sign. Remember the concatenated list of sentence 2 and 3? We will use that new sentence and replace the first element, "The", with another element, "polar bears". Using the zero index for the first element, we assign it 'polar bear's with an equals sign. Then our new sentence will reflect the change.

Code:

 # replacement
 
 In [39]: new
Out[39]:
['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

In [40]: new[0] = 'polar bears'

In [41]: new
Out[41]:
['polar bears',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']
 
 In [43]: new[1:4] = ['do not','live in','antarctica']

In [44]: new
Out[44]:
['polar bears',
 'do not',
 'live in',
 'antarctica',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

Additionally we can replace multiple entries at the same time by assigning the desired new list to the slice which we want to replace. For example, if we want to replace the second through fourth elements in the list, we can assign them new elements from a different list. Then we can view the modified new list, saying that polar bears do not live in Antarctica.

OK folks, here we explored how to manipulate lists, target certain elements of a list with indexes, and replace elements. Stay tuned for more on natural language processing with Python.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

Wednesday, August 27, 2014

Natural Language Processing in Python: Part 2. Counting Vocabulary

Hello Readers,

Welcome to Part 2 of the Text Analysis with Python series. Here we will quantify words and tokens in texts with the nltk module in IPython. For more on getting started with nltk, click here.

Loading Texts

We start by load the texts we will be using from nltk.book. Remember to use "ipython" with "--matplotlib" in the command line to start. After we import the texts, we can see the titles in the output.

Code:

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\wayne>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [2]: text3
Out[2]: <Text: The Book of Genesis>

We will take a closer look at the third text, Book of Genesis.

Counting

First off, we start with the length of the text with "len()". It turns out that there are 44,764 tokens, or sequences of characters in text 3. Of course some words are used more than once. To obtain the unique tokens, we use "set()", and we nest it in "sorted()" to sort alphabetically. Looking at the first 20, we first observe punctuation, then our first word, "A", followed by "Abel". Note that after punctuation, capitalized words precede un-capitalized words in order.

Code:

In [3]: len(text3)
Out[3]: 44764

# 44674 tokens
# token == name for sequence of characters
# for unique tokens, use set()

In [6]: sorted(set(text3))[:20]
Out[6]:
[u'!',
 u"'",
 u'(',
 u')',
 u',',
 u',)',
 u'.',
 u'.)',
 u':',
 u';',
 u';)',
 u'?',
 u'?)',
 u'A',
 u'Abel',
 u'Abelmizraim',
 u'Abidah',
 u'Abide',
 u'Abimael',
 u'Abimelech']

# punctuation and capital words precede lower case words

Distinct Count

For the number of unique tokens, we saw that "set()" would return an array of the tokens. To get a count, we just nest it in "len()".

Code:

In [7]: len(set(text3))
Out[7]: 2789

# has 2,789 distinct types
# word type =  form or spelling of word independent of its 
## specific occurrences in the text (unique)
# type = includes word types and punctuation
# which is why 2,789 is types not word types

Note that we include punctuation in the types as well since there are not word types only. We arrive at 2,789 distinct types used in text 3, Book of Genesis.

Lexical Diversity

Looking at a particular text, we can measure the average use of a word with the division of the total number of words by the number of unique ones. This measure will tell us the average number of times an average word is used. To create the measure we will take the "len()" and divide it by the "len(set()" of text 3. Note to import the division from __future__ to use floating point division, since Python versions 2.** uses integer division.

For text 3, we see that the average type is used roughly 16 times on average in the text. By using ".count()", we can numerate the occurrence of a sequence of characters. For example, "smote" in text 3 has 5 occurrences, less than the average of 16.

Code:

# lexical richness of text

In [5]: from __future__ import division

In [6]: len(text3) / len(set(text3))
Out[6]: 16.050197203298673

In [7]: text3.count('smote')
Out[7]: 5

In [8]: 100*text4.count('a') / len(text4)
Out[8]: 1.4643016433938312

In [9]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [10]: text5.count('lol')
Out[10]: 704

In [11]: text5.count('lol')/len(text5)*100
Out[11]: 1.5640968673628082

In [12]: len(text5)/len(set(text5))
Out[12]: 7.420046158918563

When we take the count and divide it by the total types and multiply by 100, we get the percentage of that particular word in the text. So for an often used word such as 'a' in the fourth text, Inaugural Address Corpus, it occurs in ~1.46% of the text.

A text or internet short hand 'lol', for laugh out loud, occurs 704 times, at ~1.56% in text five, the chat room corpus. However, the diversity of words is quite high, as the average word is used only about 7.4 times, compared to higher repeats of 16 in text 3, the Book of Genesis.

So here we covered how to count words, determine the length of the text, find the number of distinct types, and word diversity in Python. Stay tuned for more posts!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

Thursday, August 21, 2014

Natural Language Processing in Python: Part 1. Texts

Hello Readers,

Here we begin exploring Natural Language Processing in Python using the nltk module. Previously, I posted a Text Mining blog series, specifically with Twitter data. With Python we progress one step further into Text Analysis: language processing.

Let us get started. Start up Python (I am using 2.7), and install the nltk module with "pip install nltk" from the command line or use the installer from their website.

Starting nltk

nltk consists of multiple lexical sources, corpora, or structured text, and text processing libraries. You first need to download the text sources using "import nltk" and "nltk.download()", after which a download box should appear. Choose the destination folder, select the book option in Collections, and click download.

Figure 1. nltk Download

To check the installation, "from nltk.book import *" would import all the books in the collection. In your command prompt, it might look similar to this:

Figure 2. IPython in the Command Prompt with nltk Books

Note that it might take more than a few seconds for the large texts to load. Below (and above in Figure 2), you can view the loaded nine text names, and you can print them again with "texts()". To gather information about a specific text, just type in the text number, "text1", and it turns out to be Moby Dick written by Herman Melville in 1851.

Code:

In [2]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [3]: text1
Out[3]: <Text: Moby Dick by Herman Melville 1851>

Looking through the texts, there are numerous variety among them. From Jane Austen, to the Book of Genesis, it provides quite a selection: the Inaugural Address Corpus, Chat Corpus, which covers uncensored chat room dialogue, and text from the Wall Street Journal. Typing the text key, such as "text1", will give more information about the text.

Searching Text

Now that we have access to the text, we can begin exploring the contents. One useful method of the Text object is ".concordance()". It prints the context of a particular word where it occurs in the text. This is an excellent way to understand how the author uses a specific word. For example, we search for the word "monstrous" in the first text, Moby Dick below. The method returns 11 matches of the "monstrous" and also gives the words around where it occurs in the text.

Code:

In [4]: text1.concordance('monstrous')
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

In [5]: text2
Out[5]: <Text: Sense and Sensibility by Jane Austen 1811>

In [6]: text2.concordance('affection')
Building index...
Displaying 25 of 79 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
 can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
 the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This
 opinion . But by an appeal to her affection for her mother , by representing t
 every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward ' s affection , to the remembrance of every mark
 was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if

Additionally, we can search the second text, Jane Austen's Sense and Sensibility for the word "affection". ".concordance()" returns the first 25 matches from the total 79 in the text, while showing the word context.

What about the similarities among contexts of a certain word? We would use ".similar()" and it would show us the similar context words from our ".concordance()" output. For the word 'monstrous', in text1, some words used in the same context are : "imperial", "subtly", "impalpable", etc. In Moby Dick, Melville uses 'monstrous' quite differently when compared to 'monstrous' in Sense and Sensibility by Austen.

Code:

In [9]: text1.similar('monstrous')
Building word-context index...
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate

In [10]: text2.similar('monstrous')
Building word-context index...
a great good very amazingly exceedingly as so sweet heartily
remarkably extremely vast

In [12]: text2.common_contexts(['monstrous','very'])
a_pretty is_pretty a_lucky am_glad be_glad

Austen uses 'monstrous' with words such as 'great', 'good', 'amazingly', and 'sweet'. Therefore, Melville writes 'monstrous' with sometimes a positive or negative connotation, whereas Austen associates 'monstrous' with a positive connotation in Sense and Sensibility. You can also see the similar context words for two words with the ".common_contexts()" method. The words 'monstrous' and 'very' are belong in the same group as 'pretty', 'lucky', and 'glad', all positive adjectives.

Inaugural Address Corpus

Now we move on to the large corpus composed of Presidential inaugural addresses. Looking at a keyword such as 'freedom', ".concordance()" prints the first 25 of 183. Many of the early mentions have the structure: "freedom of ___", where blank is religion the press, etc.

Code:

In [14]: text4
Out[14]: <Text: Inaugural Address Corpus>

In [15]: text4.concordance('freedom')
Building index...
Displaying 25 of 183 matches:
s at the bar of the public reason ; freedom of religion ; freedom of the press
blic reason ; freedom of religion ; freedom of the press , and freedom of perso
ligion ; freedom of the press , and freedom of person under the protection of t
e instrumental to the happiness and freedom of all . Relying , then , on the pa
s of an institution so important to freedom and science are deeply to be regret
 be fairly and fully made , whether freedom of discussion , unaided by power ,
te and personal rights , and of the freedom of the press ; to observe economy i
rdinary lot of humanity secured the freedom and happiness of this people . We n
s inseparable from the enjoyment of freedom , but which have more than once app
 the abuse of power consists in the freedom , the purity , and the frequency of
ation to the civil power ; that the freedom of the press and of religious opini
 own ; to cherish the principles of freedom and of equal rights wherever they w
l Government into conflict with the freedom of elections , and the counteractio
f appointment more independence and freedom from such influences might be expec
e it might exert in controlling the freedom of the elective franchise . If such
of the Executive in controlling the freedom of the elective franchise through t
 from the mother country that " the freedom of the press is the great bulwark o
the people as the genuine spirit of freedom , and , like the false Christs whos
by the gifts of civil and religious freedom , who watched over and prospered th
 , which have been the offspring of freedom , and not of power . This most admi
ween church and state , and perfect freedom of opinion is guaranteed to all sec
o circumscribed the limits of human freedom and planted their standard , where
encouragement and potent appeal for freedom will be its own history -- its tria
ur sympathies in the cause of human freedom and universal advancement . But the
this Constitution the boundaries of freedom have been enlarged , the foundation

In [16]: text4.similar('freedom')
Building word-context index...
aided practicability dissolution all obstruction protest aegis
integrity assembled aggression votes paris boastfulness refunding
impoverishment under pride merchant lure risk

In [17]: text4.common_contexts(['freedom','terror'])
of_and the_of of_that

What do the contexts of 'freedom' have in common with each other? Again, with ".similar()" we see "aided", "practicability", "dissolution", "all", "obstruction", and many other terms. What about the contexts of 'freedom' and 'terror'? With the recent War on Terror, do any of the word choices reflect the events of the times? Not really, with words "of", "and", "the", and "that", it appears that those two words do not use similar words when spoken in an address.

Dispersion Plot

Since the Inaugural Address Corpus contains addresses from over the years, we can use a dispersion plot to uncover any temporal trends in word usage. With ".dispersion_plot()" we pass an array of terms which we want to visualize. Let's look at the terms 'citizens', 'democracy', 'freedom', 'duties', and 'America'.

Code:

1	In [19]: text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])

The result is this plot:

Figure 3. Dispersion Plot, Text4

We can see that 'citizens' is best distributed, therefore, most used through the years in addresses. 'freedom' and 'America' see quick increases in usage around the 80,000 word offset. More addresses contain 'freedom' and 'America', while less addresses contain the word 'duties' in recent years when compared to earlier addresses. This could be explained by the World Wars and the emergence of the US as a subsequent world power- where fighting for 'freedom' and 'America' were relevant political keywords. We can see that some words call into favor while others, such as 'duties', fall from favor.

Okay folks, you have just read a post on natural language processing using the nltk module in Python. You learned how to search for word contexts, compare the contexts, and find similar words with different word contexts. Also, when the text is organized by time, we can use the dispersion plot to visualize word frequency and usage by time. Stay tuned for the next post in this series!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

Pages