Pages

Wednesday, August 27, 2014

Natural Language Processing in Python: Part 2. Counting Vocabulary


Hello Readers,

Welcome to Part 2 of the Text Analysis with Python series. Here we will quantify words and tokens in texts with the nltk module in IPython. For more on getting started with nltk, click here.


Loading Texts

We start by load the texts we will be using from nltk.book. Remember to use "ipython" with "--matplotlib" in the command line to start. After we import the texts, we can see the titles in the output.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\wayne>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [2]: text3
Out[2]: <Text: The Book of Genesis>

We will take a closer look at the third text, Book of Genesis.



Counting

First off, we start with the length of the text with "len()". It turns out that there are 44,764 tokens, or sequences of characters in text 3. Of course some words are used more than once. To obtain the unique tokens, we use "set()", and we nest it in "sorted()" to sort alphabetically. Looking at the first 20, we first observe punctuation, then our first word, "A", followed by "Abel". Note that after punctuation, capitalized words precede un-capitalized words in order.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In [3]: len(text3)
Out[3]: 44764

# 44674 tokens
# token == name for sequence of characters
# for unique tokens, use set()

In [6]: sorted(set(text3))[:20]
Out[6]:
[u'!',
 u"'",
 u'(',
 u')',
 u',',
 u',)',
 u'.',
 u'.)',
 u':',
 u';',
 u';)',
 u'?',
 u'?)',
 u'A',
 u'Abel',
 u'Abelmizraim',
 u'Abidah',
 u'Abide',
 u'Abimael',
 u'Abimelech']

# punctuation and capital words precede lower case words


Distinct Count

For the number of unique tokens, we saw that "set()" would return an array of the tokens. To get a count, we just nest it in "len()".

Code:
1
2
3
4
5
6
7
8
In [7]: len(set(text3))
Out[7]: 2789

# has 2,789 distinct types
# word type =  form or spelling of word independent of its 
## specific occurrences in the text (unique)
# type = includes word types and punctuation
# which is why 2,789 is types not word types

Note that we include punctuation in the types as well since there are not word types only. We arrive at 2,789 distinct types used in text 3, Book of Genesis.



Lexical Diversity

Looking at a particular text, we can measure the average use of a word with the division of the total number of words by the number of unique ones. This measure will tell us the average number of times an average word is used. To create the measure we will take the "len()" and divide it by the "len(set()" of text 3. Note to import the division from __future__ to use floating point division, since Python versions 2.** uses integer division.

For text 3, we see that the average type is used roughly 16 times on average in the text. By using ".count()", we can numerate the occurrence of a sequence of characters. For example, "smote" in text 3 has 5 occurrences, less than the average of 16.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# lexical richness of text

In [5]: from __future__ import division

In [6]: len(text3) / len(set(text3))
Out[6]: 16.050197203298673

In [7]: text3.count('smote')
Out[7]: 5

In [8]: 100*text4.count('a') / len(text4)
Out[8]: 1.4643016433938312

In [9]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [10]: text5.count('lol')
Out[10]: 704

In [11]: text5.count('lol')/len(text5)*100
Out[11]: 1.5640968673628082

In [12]: len(text5)/len(set(text5))
Out[12]: 7.420046158918563

When we take the count and divide it by the total types and multiply by 100, we get the percentage of that particular word in the text. So for an often used word such as 'a' in the fourth text, Inaugural Address Corpus, it occurs in ~1.46% of the text.


A text or internet short hand 'lol', for laugh out loud, occurs 704 times, at ~1.56% in text five, the chat room corpus. However, the diversity of words is quite high, as the average word is used only about 7.4 times, compared to higher repeats of 16 in text 3, the Book of Genesis.

So here we covered how to count words, determine the length of the text, find the number of distinct types, and word diversity in Python. Stay tuned for more posts!



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

1 comment:

  1. Deposit Menggunakan GO Pay di Agen Judi Online terpopuler di Indonesia !

    Minimal Deposit 50ribu saja, Tersedia di situs www.bolavita.fun Anda dapat menikmati berbagai jenis judi online yang tersedia ...

    Info selengkapnya silahkan hubungi :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Keluaran Togel Singapura Hari Ini yang terbaru

    ReplyDelete