Hello Readers,
We settle into Part 4 of the Text Analysis Series with Python by examining frequency distributions, word selections, and collocations. As a refresher, collocations are a sequence of words that occur together unusually often, given individual word frequencies.
A specific case would be a pair of words forming a collocation in a bigram, such as 'red wine'. Bravo or brava, a trigram sequence would be three words long. 'red' and 'wine' would occur together quite often, as opposed to generic 'the wine', and 'maroon wine' would make little sense. That demonstrates how collocations are resistant to substitutions- because only those certain words carry that meaning, so those specific words are used. And only those words, so they occur quite frequently together.
Before we jump into collocations, let's start with frequency distributions. In the 'nltk' module in Python, we have a number of text corpora available for analysis. Load it, and get yourself comfortable for the ride.
Frequency Distributions
So we did some counting in a previous NLP post. We will count in this post as well, but in a different process. We aim to quantify each unique token in a given text corpus. How many times does 'government' occur in the Inaugural Address corpus? What about 'lol' in the Internet Chat corpus? Use the 'FreqDist()' method! As we see below, 'FreqDist()' takes the text and creates a frequency distribution for the unique tokens, and they aren't all words. They can be periods, parenthesis, commas, etc.
With the summary of 'fdist1', we observe that it has 19,317 samples, or tokens, and 260,819 total counts, or length of the text.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. %guiref -> A brief reference about the graphical user interface. In [1]: from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 ##### # frequency distributions # create FreqDist object In [3]: fdist1 = FreqDist(text1) # summary of FreqDist object, 19317 unique tokens with 260,819 total tokens In [4]: fdist1 Out[4]: <FreqDist with 19317 samples and 260819 outcomes> # retrieve set of tokens In [5]: vocab1 = fdist1.keys() # display first 10 of set In [6]: vocab1[:10] Out[6]: [u'funereal', u'unscientific', u'divinely', u'foul', u'four', u'gag', u'prefix', u'woods', u'clotted', u'Duck'] # display number of occurrences for 'whale' token In [7]: fdist1['whale'] Out[7]: 906 # plot first 20 terms In [8]: fdist1.plot(20, cumulative=False) |
Accessing the '.keys()' method, we can assign the unique token to 'vocab1' and look into the first 10 elements. We see words such as 'funereal', 'unscientific', and 'divinely'. Lastly we can look up the count of a word. Take 'whale', since text1 is Moby Dick, and we see that text1 has 906 occurrences of 'whale'.
Lastly, the 'FreqDist' object has a plot function. We specify the number of terms, and whether the plot is cumulative or not, and Python returns Figure 1.
Figure 1. Frequency Distribution of 20 Terms in Text1 |
Notice how 'four' has the highest count in this sample of 20 terms, with 'hanging' coming in second. The words you see with barely any count have a count of 1- they only occur in the text corpus once! These special terms are named hapaxes (hapax singular). nltk has a special function just for identifying hapaxes. You guessed it, '.hapaxes()'!
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ###### # hapaxes are words that occur only once # display first 10 occurrences In [8]: fdist1.hapaxes()[:10] Out[8]: [u'funereal', u'unscientific', u'prefix', u'plaudits', u'woody', u'disobeying', u'Westers', u'DRYDEN', u'Untried', u'superficially'] # display total number of single occurrences In [9]: len(fdist1.hapaxes()) Out[9]: 9002 |
The first 10 hapaxes can be sliced from the function, and we can also see how many unique terms are in text1 by passing the result to the 'len()' method. We see that text1 has 9002 terms which occur only once.
Word Selections
Now we will take advantage of Python looping through iterable objects to select words with certain attributes of word length and occurrences in a corpus. This way we can search for defining words which capture the essence of a corpus, or track trends in word usage. We can throttle the word count or word length to our needs in different circumstances.
Word Length
We can create an if condition to accept only words with more than 15 characters in the set of unique tokens from text1, as shown below. Some long words which occur are: 'CIRCUMNAVIGATION', 'Physiognomically', and 'apprehensiveness'.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | # using set theory to select words # lengthy words of 15 characters or more # {w | w E V & P(w)} # [w for w in V if P(w)] # the set of all w such that w is an element of V (vocab) and has property P # get set of vocab in text1 In [10]: V = set(text1) # iterate through V, grabbing each word with character length greater than 15 In [11]: long_words = [w for w in V if len(w) > 15] # display sorted first 10 lengthy words In [12]: sorted(long_words)[:10] Out[12]: [u'CIRCUMNAVIGATION', u'Physiognomically', u'apprehensiveness', u'cannibalistically', u'characteristically', u'circumnavigating', u'circumnavigation', u'circumnavigations', u'comprehensiveness', u'hermaphroditical'] ##### # looking at internet long word patterns # more than 15 characters # check which text number is internet chat, #5 In [13]: texts() text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 # create unique vocab set In [14]: vocab = set(text5) # iterate through vocab for words greater than 15 characters in length In [15]: long_chat_words = [word for word in vocab if len(word) > 15] # display first 10 sorted In [16]: sorted(long_chat_words)[:10] Out[16]: [u'!!!!!!!!!!!!!!!!', u'!!!!!!!!!!!!!!!!!!!!!!', u'!!!!!!!!!!!!!!!!!!!!!!!', u'!!!!!!!!!!!!!!!!!!!!!!!!!!!', u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!', u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!', u'#talkcity_adults', u'(((((((((((((((((', u'((((((((((((((((((', u'(((((((((((((((((((('] # display 101st to 110th sorted, no results In [17]: sorted(long_chat_words)[100:111] Out[17]: [] # index from last for last 10 # observe exaggerated chat patterns In [18]: sorted(long_chat_words)[-10:] Out[18]: [u'oooooooooooooonnnnnnnnnnnneeeeeeeeeeeeeeesssssssss', u'raaaaaaaaaaaaaaaaaaaaaaaaaaaaa', u'tatatatnanaantatat', u'weeeeeeeeeeeeeeee', u'weeeeeeeeeeeeeeeeeeeeeeeeed', u'wheeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', u'woooooooooaaaahhhhhhhhhhhh', u'wooooooooooooohoooooooooooooooo', u'www.Wunderground.com', u'yuuuuuuuuuuuummmmmmmmmmmm'] |
Furthermore, we use the Internet Chat corpus, text5, to examine some words with long length. The first few are simply exclamation points, while the last few are 'overspelled' for dramatic effect.
Word Length and Frequency
With word length, we could consider another attribute to select words from a corpus. We could use the word frequency. Even if a word is long winded, if it occurs more than a few times it could be indicative of an important word in the corpus. So we include the count from the frequency distribution.
Looking at the Internet Chat corpus (text5) again, we select for the word length to be more than 8, and a frequency higher than 5. Keep in mind we have to create the frequency distribution from text5 first, and use the set of text5 as the iterable variable.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ##### # which words typify a text? # the long words or the single occurrences? (hapaxes) # what about frequently occurring long words? # display text names In [19]: texts() text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 # create FreqDist object for text5 In [20]: fdist5 = FreqDist(text5) # sort words iterated through set of text5 # having character length more than 8, and occurring more than 5 times In [25]: selected_words = sorted([w for w in set(text5) if len(w) > 8 and fdist5[w] > 5]) # display words selected on minimum word length and occurrence In [26]: selected_words Out[26]: [u'#14-19teens', u'#talkcity_adults', u'((((((((((', u')))))))))))', u')))))))))))))', u'.........', u'Compliments', u'cute.-ass', u'everybody', u'everything', u'listening', u'seriously', u'something', u'sometimes'] |
The results vary in content, as Internet Chat is not censored, as you might discover if you delve deeper into the text.
Bigrams & Collocations
Here we arrive at the word pairs and special word pairs. The 'bigrams()' method creates pairings of words as it iterates through the text, combining adjacent words. Collocations pull those word pairs which exist together unusually frequently, and you might find that they have a particular meaning when seen together and are not descriptive when apart.
The bigram for 'more is said than done' is shown below. Note that the adjacent words are paired. For collocations, use the '.collocations()' method on a text corpus to retrieve the list of collocation terms. Looking at text4 Inaugural Address corpus, 'United States', 'fellow citizens', 'four years', and such are mentioned in those combinations more than not. 'Indian tribes', and 'Chief Justice', are terms which occur together.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | # collocation : sequence of words that occur together unusually often # ex: 'red wine', as opposed to 'the wine' # bigram : word pairs # create bigram list In [34]: bigram1 = list(bigrams(['more','is','said','than','done'])) # display bigram, note pairs of words In [35]: bigram1 Out[35]: [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] # collocation for text4 # bigrams with words that occur together more frequently # than expected based on frequency of individual words In [37]: text4.collocations() Building collocations list United States; fellow citizens; four years; years ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties # collocations for personals corpus In [38]: texts() text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 # display personals collocations In [39]: text8.collocations() Building collocations list would like; medium build; social drinker; quiet nights; non smoker; long term; age open; Would like; easy going; financially secure; fun times; similar interests; Age open; weekends away; poss rship; well presented; never married; single mum; permanent relationship; slim build |
For the personals corpus, text8, we encounter word pairs in personal advertisements online. Likely and logical word combinations such as 'medium build', 'social drinker', 'quiet nights', easy going', 'financially secure', and 'permanent relationship' are paired together due to the nature of personality and lifestyle description. Simply put, those words go together, like peanut butter and jelly, although I am a 'peanut butter will do' person. The pair 'peanut butter' would be a collocation in a baking recipe or snacks corpus.
Wow, there goes another post! I know there was only one graph this time, so thanks for making it this far. Here we learned about frequency distributions, different ways to select words from a text corpus, and word selection tools to create bigrams and collocations in Python. This is just the beginning, and there is lots more text analysis to cover! So stay tuned.
Thanks for reading,
Wayne
@beyondvalence
Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.
great post. loved it. python training in Chennai
ReplyDeleteThis comment has been removed by the author.
ReplyDelete