Pages

Thursday, August 21, 2014

Natural Language Processing in Python: Part 1. Texts


Hello Readers,

Here we begin exploring Natural Language Processing in Python using the nltk module. Previously, I posted a Text Mining blog series, specifically with Twitter data. With Python we progress one step further into Text Analysis: language processing.

Let us get started. Start up Python (I am using 2.7), and install the nltk module with "pip install nltk" from the command line or use the installer from their website.


Starting nltk

nltk consists of multiple lexical sources, corpora, or structured text, and text processing libraries. You first need to download the text sources using "import nltk" and "nltk.download()", after which a download box should appear. Choose the destination folder, select the book option in Collections, and click download.


Figure 1. nltk Download

To check the installation, "from nltk.book import *" would import all the books in the collection. In your command prompt, it might look similar to this:



Figure 2. IPython in the Command Prompt with nltk Books 

Note that it might take more than a few seconds for the large texts to load. Below (and above in Figure 2), you can view the loaded nine text names, and you can print them again with "texts()". To gather information about a specific text, just type in the text number, "text1", and it turns out to be Moby Dick written by Herman Melville in 1851.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
In [2]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [3]: text1
Out[3]: <Text: Moby Dick by Herman Melville 1851>

Looking through the texts, there are numerous variety among them. From Jane Austen, to the Book of Genesis, it provides quite a selection: the Inaugural Address Corpus, Chat Corpus, which covers uncensored chat room dialogue, and text from the Wall Street Journal. Typing the text key, such as "text1", will give more information about the text. 



Searching Text

Now that we have access to the text, we can begin exploring the contents. One useful method of the Text object is ".concordance()". It prints the context of a particular word where it occurs in the text. This is an excellent way to understand how the author uses a specific word. For example, we search for the word "monstrous" in the first text, Moby Dick below. The method returns 11 matches of the "monstrous" and also gives the words around where it occurs in the text.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
In [4]: text1.concordance('monstrous')
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

In [5]: text2
Out[5]: <Text: Sense and Sensibility by Jane Austen 1811>

In [6]: text2.concordance('affection')
Building index...
Displaying 25 of 79 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
 can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
 the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This
 opinion . But by an appeal to her affection for her mother , by representing t
 every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward ' s affection , to the remembrance of every mark
 was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if

Additionally, we can search the second text, Jane Austen's Sense and Sensibility for the word "affection". ".concordance()" returns the first 25 matches from the total 79 in the text, while showing the word context.


What about the similarities among contexts of a certain word? We would use ".similar()" and it would show us the similar context words from our ".concordance()" output. For the word 'monstrous', in text1, some words used in the same context are : "imperial", "subtly", "impalpable", etc. In Moby Dick, Melville uses 'monstrous' quite differently when compared to 'monstrous' in Sense and Sensibility by Austen.


Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
In [9]: text1.similar('monstrous')
Building word-context index...
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate

In [10]: text2.similar('monstrous')
Building word-context index...
a great good very amazingly exceedingly as so sweet heartily
remarkably extremely vast

In [12]: text2.common_contexts(['monstrous','very'])
a_pretty is_pretty a_lucky am_glad be_glad

Austen uses 'monstrous' with words such as 'great', 'good', 'amazingly', and 'sweet'. Therefore, Melville writes 'monstrous' with sometimes a positive or negative connotation, whereas Austen associates 'monstrous' with a positive connotation in Sense and Sensibility. You can also see the similar context words for two words with the ".common_contexts()" method. The words 'monstrous' and 'very' are belong in the same group as 'pretty', 'lucky', and 'glad', all positive adjectives.



Inaugural Address Corpus

Now we move on to the large corpus composed of Presidential inaugural addresses. Looking at a keyword such as 'freedom', ".concordance()" prints the first 25 of 183. Many of the early mentions have the structure: "freedom of ___", where blank is religion the press, etc. 

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
In [14]: text4
Out[14]: <Text: Inaugural Address Corpus>

In [15]: text4.concordance('freedom')
Building index...
Displaying 25 of 183 matches:
s at the bar of the public reason ; freedom of religion ; freedom of the press
blic reason ; freedom of religion ; freedom of the press , and freedom of perso
ligion ; freedom of the press , and freedom of person under the protection of t
e instrumental to the happiness and freedom of all . Relying , then , on the pa
s of an institution so important to freedom and science are deeply to be regret
 be fairly and fully made , whether freedom of discussion , unaided by power ,
te and personal rights , and of the freedom of the press ; to observe economy i
rdinary lot of humanity secured the freedom and happiness of this people . We n
s inseparable from the enjoyment of freedom , but which have more than once app
 the abuse of power consists in the freedom , the purity , and the frequency of
ation to the civil power ; that the freedom of the press and of religious opini
 own ; to cherish the principles of freedom and of equal rights wherever they w
l Government into conflict with the freedom of elections , and the counteractio
f appointment more independence and freedom from such influences might be expec
e it might exert in controlling the freedom of the elective franchise . If such
of the Executive in controlling the freedom of the elective franchise through t
 from the mother country that " the freedom of the press is the great bulwark o
the people as the genuine spirit of freedom , and , like the false Christs whos
by the gifts of civil and religious freedom , who watched over and prospered th
 , which have been the offspring of freedom , and not of power . This most admi
ween church and state , and perfect freedom of opinion is guaranteed to all sec
o circumscribed the limits of human freedom and planted their standard , where
encouragement and potent appeal for freedom will be its own history -- its tria
ur sympathies in the cause of human freedom and universal advancement . But the
this Constitution the boundaries of freedom have been enlarged , the foundation

In [16]: text4.similar('freedom')
Building word-context index...
aided practicability dissolution all obstruction protest aegis
integrity assembled aggression votes paris boastfulness refunding
impoverishment under pride merchant lure risk

In [17]: text4.common_contexts(['freedom','terror'])
of_and the_of of_that

What do the contexts of 'freedom' have in common with each other? Again, with ".similar()" we see "aided", "practicability", "dissolution", "all", "obstruction", and many other terms. What about the contexts of 'freedom' and 'terror'? With the recent War on Terror, do any of the word choices reflect the events of the times? Not really, with words "of", "and", "the", and "that", it appears that those two words do not use similar words when spoken in an address.



Dispersion Plot


Since the Inaugural Address Corpus contains addresses from over the years, we can use a dispersion plot to uncover any temporal trends in word usage. With ".dispersion_plot()" we pass an array of terms which we want to visualize. Let's look at the terms 'citizens', 'democracy', 'freedom', 'duties', and 'America'.

Code:
1
In [19]: text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])

The result is this plot:



Figure 3. Dispersion Plot, Text4


We can see that 'citizens' is best distributed, therefore, most used through the years in addresses. 'freedom' and 'America' see quick increases in usage around the 80,000 word offset. More addresses contain 'freedom' and 'America', while less addresses contain the word 'duties' in recent years when compared to earlier addresses. This could be explained by the World Wars and the emergence of the US as a subsequent world power- where fighting for 'freedom' and 'America' were relevant political keywords. We can see that some words call into favor while others, such as 'duties', fall from favor.


Okay folks, you have just read a post on natural language processing using the nltk module in Python. You learned how to search for word contexts, compare the contexts, and find similar words with different word contexts. Also, when the text is organized by time, we can use the dispersion plot to visualize word frequency and usage by time. Stay tuned for the next post in this series!




Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

2 comments:

  1. Is it okay to post part of this on my website basically post a hyperlink to this webpage? coconut oil pulling

    ReplyDelete
  2. Deposit Menggunakan GO Pay di Agen Judi Online terpopuler di Indonesia !

    Minimal Deposit 50ribu saja, Tersedia di situs www.bolavita.fun Anda dapat menikmati berbagai jenis judi online yang tersedia ...

    Info selengkapnya silahkan hubungi :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Keluaran Togel Singapura Hari Ini yang terbaru

    ReplyDelete