Pages

Friday, September 5, 2014

Natural Language Processing in Python: Part 3. Indexing Lists


Hello Readers,

Here we continue the Text Analysis in Python blog series by examining list manipulation. Keeping in mind we work with large amounts of text, we increase our efficiency by knowing how to manipulate it in Python. Working with lists is crucial towards natural language processing. As usual, we will use the nltk module and IPython. Keep an eye out for Monty Python.

Start IPython from the command line and let us begin.


Sentences as Lists

Previously when we worked with nltk we used the texts and terms inside them. However, nltk also provides the first sentence from each of the 9 texts as variables we can use as well (sent1, sent2... sent9). The sentences are of list type, which are enclosed in brackets [ ]. Each element in a list can be composed of strings, numbers, even other lists.

So by typing the "sent" and the sentence number, we can print the tokens in the list.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
C:\Users\wayne>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [29]: sents()
sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a clo
ud of sunset .

In [2]: sent1
Out[2]: ['Call', 'me', 'Ishmael', '.']


Modifying Lists

We can modify the list by appending elements, which add those elements to the end of the list. Using ".append()" we can change the list to our liking, and also use an operation called concatenation, where we join two lists together with a "+" sign. It is like addition for lists and strings.

Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# appending

In [3]: sent1.append('hello there')

In [4]: sent1
Out[4]: ['Call', 'me', 'Ishmael', '.', 'hello there']

In [5]: sent1[::-1]
Out[5]: ['hello there', '.', 'Ishmael', 'me', 'Call']

# concatenation 

In [19]: new = sent2+sent3

In [20]: new
Out[20]:
['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

As we can see, appending 'hello there' simply adds it to the end of "sent1". Calling "sent1", again we see the extra element at the end, and the change is permanent. With concatenation, we join "sent2" and "sent3" together with "+", and we end up with two mashed first sentences in one list.



List Indexes

Why do Python indexes start at zero?

Remember that in Python, lists indexes start at zero, and the last element is n-1. It is not because computer scientists want to confuse people. This is due to how Python reads the lists. Think of it this way: when Python first accesses the list, the first element is 'read', and to read the next (second) element, Python has to move once, and to read the third element, Python has to move again. Reading the third element Python has moved twice, so the third element index is 2. So by default (0), the pointer is aimed at first element, the indexes represents the number of iterations Python must move. For example, to reach the fifth element, Python must move 4 times.

We can use this concept to find indexes of certain tokens, or retrieve elements in certain indexes in lists. In index 15 of "text1", or the 16th token- not word because tokens include punctuation- is the word 'Consumptive' or someone likely with tuberculosis. Note that when we use ".index('Consumptive)", it finds the first occurrence of the word and returns its position. 

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
In [11]: text1[15]
Out[11]: u'Consumptive'

In [12]: text1.index('Consumptive')
Out[12]: 15

In [13]: text1.index('consumptive')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-e45cff026a60> in <module>()
----> 1 text1.index('consumptive')

C:\Python27\lib\site-packages\nltk\text.pyc in index(self, word)
    369         Find the index of the first occurrence of the word in the text.
    370         """
--> 371         return self.tokens.index(word)
    372
    373     def readability(self, method):

ValueError: 'consumptive' is not in list

However, because it is case sensitive, 'consumptive' is not the same as 'Consumptive' so finding it's index will not return 15. Instead it returns an ValueError because there are no lowercase consumptives tokens in the text!


Slicing Lists

With slicing, we can take sections of elements from the text lists, by denoting the start and stop index points. The stop index point indicates index limit, and will not print out that index. So for an slice of index from 10 to 20, it will return elements from index 10 to 19- starting at 10 and up to 20. Below we take the index of Monty Python and the Holy Grail in "text6", with starting index of 2301 up to 2320.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# slicing

In [14]: text6
Out[14]: <Text: Monty Python and the Holy Grail>

In [15]: text6[2301:2320]
Out[15]:
[u'for',
 u'no',
 u'man',
 u'.',
 u'ARTHUR',
 u':',
 u'So',
 u'be',
 u'it',
 u'!',
 u'ARTHUR',
 u'and',
 u'BLACK',
 u'KNIGHT',
 u':',
 u'Aaah',
 u'!,',
 u'hiyaah',
 u'!,']

# index starts at zero
 
 In [16]: sent6
Out[16]:
['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa',
 'there',
 '!']

In [17]: sent6[0]
Out[17]: 'SCENE'

In [18]: sent6[1]
Out[18]: '1'

Reading the output list, we notice this is where King Arthur encounters and duels the Black Knight:




Also, the first sentence in Monty Python introduces the (in)famous coconut horses and King Arthur.


More Slicing

There is more than one way to slice bread, and likewise there is more than one way to slice a list. For example, Python allows us to use negative indexes. Negative, you might ask? It is counter-intuitive, but simple to reorient yourself with negative indexes by starting at the end of the list. When you think of the first element as zero, simply going negative takes you to the other end- the last element. So negative indexes start from -1 to -n, where n is the length of the list, and -n is the first element (index 0).

Using the same previous Monty Python sentence, we find the last index by subtracting 1 from the length of the list. To confirm that an index of -1 is indeed the last element in the list, we compare it to our "!" result. And yes, they are the same. Taking the second to last element gives us "there" which is the correct element.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# slicing indexes - last and negative values

In [21]: len(sent6)
Out[21]: 17

In [22]: sent6[16]
Out[22]: '!'

In [23]: sent6[-1]
Out[23]: '!'

In [24]: sent6[-2]
Out[24]: 'there'

# slicing negative values

In [26]: sent6[-2:]
Out[26]: ['there', '!']

In [27]: sent6[:-2]
Out[27]:
['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa']

What if we take a slice using a negative starting point? For example, if we choose -2 as our starting index, what will Python return? Take a look at the code above. Python returns the second to last result to the end of the list. And what about a negative index for the ending point? Just what you guessed. It returns the elements at beginning of the list, all the way up to, but not including, the second to last element, "there".



Replacing Elements


We can assign specific values to elements in lists using the "=" sign. Remember the concatenated list of sentence 2 and 3? We will use that new sentence and replace the first element, "The", with another element, "polar bears". Using the zero index for the first element, we assign it 'polar bear's with an equals sign. Then our new sentence will reflect the change.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
 # replacement
 
 In [39]: new
Out[39]:
['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

In [40]: new[0] = 'polar bears'

In [41]: new
Out[41]:
['polar bears',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']
 
 In [43]: new[1:4] = ['do not','live in','antarctica']

In [44]: new
Out[44]:
['polar bears',
 'do not',
 'live in',
 'antarctica',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

Additionally we can replace multiple entries at the same time by assigning the desired new list to the slice which we want to replace. For example, if we want to replace the second through fourth elements in the list, we can assign them new elements from a different list. Then we can view the modified new list, saying that polar bears do not live in Antarctica.


OK folks, here we explored how to manipulate lists, target certain elements of a list with indexes, and replace elements. Stay tuned for more on natural language processing with Python.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

2 comments:

  1. Deposit Menggunakan GO Pay di Agen Judi Online terpopuler di Indonesia !

    Minimal Deposit 50ribu saja, Tersedia di situs www.bolavita.fun Anda dapat menikmati berbagai jenis judi online yang tersedia ...

    Info selengkapnya silahkan hubungi :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Keluaran Togel Singapura Hari Ini yang terbaru

    ReplyDelete

  2. Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles Python Programming Training

    ReplyDelete