Hello Readers,
Welcome to Part 2 of the Text Analysis with Python series. Here we will quantify words and tokens in texts with the nltk module in IPython. For more on getting started with nltk, click here.
Loading Texts
We start by load the texts we will be using from nltk.book. Remember to use "ipython" with "--matplotlib" in the command line to start. After we import the texts, we can see the titles in the output.Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\wayne>ipython --matplotlib Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Using matplotlib backend: Qt4Agg In [1]: from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 In [2]: text3 Out[2]: <Text: The Book of Genesis> |
We will take a closer look at the third text, Book of Genesis.
Counting
First off, we start with the length of the text with "len()". It turns out that there are 44,764 tokens, or sequences of characters in text 3. Of course some words are used more than once. To obtain the unique tokens, we use "set()", and we nest it in "sorted()" to sort alphabetically. Looking at the first 20, we first observe punctuation, then our first word, "A", followed by "Abel". Note that after punctuation, capitalized words precede un-capitalized words in order.Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | In [3]: len(text3) Out[3]: 44764 # 44674 tokens # token == name for sequence of characters # for unique tokens, use set() In [6]: sorted(set(text3))[:20] Out[6]: [u'!', u"'", u'(', u')', u',', u',)', u'.', u'.)', u':', u';', u';)', u'?', u'?)', u'A', u'Abel', u'Abelmizraim', u'Abidah', u'Abide', u'Abimael', u'Abimelech'] # punctuation and capital words precede lower case words |
Distinct Count
For the number of unique tokens, we saw that "set()" would return an array of the tokens. To get a count, we just nest it in "len()".Code:
1 2 3 4 5 6 7 8 | In [7]: len(set(text3)) Out[7]: 2789 # has 2,789 distinct types # word type = form or spelling of word independent of its ## specific occurrences in the text (unique) # type = includes word types and punctuation # which is why 2,789 is types not word types |
Note that we include punctuation in the types as well since there are not word types only. We arrive at 2,789 distinct types used in text 3, Book of Genesis.
Lexical Diversity
Looking at a particular text, we can measure the average use of a word with the division of the total number of words by the number of unique ones. This measure will tell us the average number of times an average word is used. To create the measure we will take the "len()" and divide it by the "len(set()" of text 3. Note to import the division from __future__ to use floating point division, since Python versions 2.** uses integer division.For text 3, we see that the average type is used roughly 16 times on average in the text. By using ".count()", we can numerate the occurrence of a sequence of characters. For example, "smote" in text 3 has 5 occurrences, less than the average of 16.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | # lexical richness of text In [5]: from __future__ import division In [6]: len(text3) / len(set(text3)) Out[6]: 16.050197203298673 In [7]: text3.count('smote') Out[7]: 5 In [8]: 100*text4.count('a') / len(text4) Out[8]: 1.4643016433938312 In [9]: texts() text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 In [10]: text5.count('lol') Out[10]: 704 In [11]: text5.count('lol')/len(text5)*100 Out[11]: 1.5640968673628082 In [12]: len(text5)/len(set(text5)) Out[12]: 7.420046158918563 |
When we take the count and divide it by the total types and multiply by 100, we get the percentage of that particular word in the text. So for an often used word such as 'a' in the fourth text, Inaugural Address Corpus, it occurs in ~1.46% of the text.
A text or internet short hand 'lol', for laugh out loud, occurs 704 times, at ~1.56% in text five, the chat room corpus. However, the diversity of words is quite high, as the average word is used only about 7.4 times, compared to higher repeats of 16 in text 3, the Book of Genesis.
So here we covered how to count words, determine the length of the text, find the number of distinct types, and word diversity in Python. Stay tuned for more posts!
Thanks for reading,
Wayne
@beyondvalence
Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.