Loading...

Monday, September 29, 2014

Handling Large Data with SQL Server


Hello Readers,

Big Data is all the rage these days. On this blog we have mainly used R as our analysis program of choice (sorry Python) to examine, model, and predict on the data. R is optimal for data with hundred thousands of rows or less, and dealing with larger data sets with millions of rows or more usually slows R down to a crawl. (Remember the neural network post?) Another deciding factor on computation speed is your computer setup. 

Because there are millions of rows in big data sets, we need only sample a few hundred thousand of them into R. But where will we store the millions of rows prior to sampling? Easy, into a data warehouse made for million plus row data sets, a SQL Server database. 

We have not dealt with hundreds of millions or billions of rows, so resist the urge to mention the Hadoop distributed file system. Or should I say not yet? Anyways, here we will load million plus row U.S. Census data sets in preparation for R sampling.


Getting Census Data


Specifically we concentrate on two data sets, U.S. Census American Community Survey 2011, where we can find nationwide population and housing data. You can download the data via ftp here, or choose 2011 ACS 1-year PUMS from the ACS site. The ACS encompasses social, economic, demographic, and housing data. A screenshot of the page is shown below:


Figure 1. American Community Survey (Census) Webpage
After clicking the red highlighted link, 2011 ACS 1-year PUMS, we choose from CSV or SAS file format. Take your pick, I chose CSV for importing into SQL Server.


Figure 2. Choose File Format - CSV
Next, the site gives us choices whether to download the entire dataset for the U.S. or individual states. Of course, download both the Population and Housing data for the entire U.S. Be warned, these two files take up nearly 1 GB of hard drive space, so be patient! Download accelerator anyone?


Figure 3. Download Population and Housing Data Files
After we have the zip files downloaded, unzip them. Note that the housing data 'ss11hus*' and population data 'ss11pus*' have parts a and b.


Figure 4. Unzipped Files
You can also find the helpful README file for the data at the ACS site as well.


Importing into SQL Server


Now for the crucial part, where we import the data into a database system, SQL Server, instead of reading the CSV files into R directly. After connecting to your server, select which database where you would import the CSV files. Right-click and under Tasks, > choose Import Files.

Figure 5. Import Files Option
Next we specify the data source. The CSV files are not Excel or Access files, rather they are a Flat File Source. So select the appropriate option, and select a CSV file from the unzipped file lcoation. On the left you can see a Columns view option under General. You can preview the file contents prior to importing it.

Figure 6. Select Flat File Source
Figure 7 shows the Columns for our preview, so we can ensure the proper headers and values are read by the SQL Server Wizard.

Figure 7. Column Previews
Even though we right-clicked a certain database, SQL Server still asks us in which database we want to import the data. You can change your mind, if you clicked wrong earlier, so take your pick.

Figure 8. Table Destination
Here we encounter the Source Table Option window. We can alter the schema and name of table with Edit Mapping at the bottom. The current schema and name are set as 'dbo' and 'ss11pusa', respectively. If you want to use a different schema than the default, now is the time to impose the change.

Figure 9. Select Source Tables
SQL Server gives us a last overview option before we start the importing process. Ready? Click Finish.

Figure 10. Last Check
And here we go! Look at the Import Wizard chug along as it imports over a million rows!

Figure 11. Import Progressing
When the process is finished, we are rewarded with... a new table lots of rows. Take a look below. 1.6 million? Not bad. Now you just have to import the remaining 3 for a complete set!

Figure 12. Execution Successful
Remember to refresh the Server instance for it to reflect the new table changes. Now you can query away in a new query window to explore the data, or you can keep the Server on and jump into R to establish a database connection using ODBC (open database connectivity).

Figure 13. Locating Imported Table
I thought this post made up for the previous few posts' lack of pictures. Was the compensation count this time sufficient (13)? Here we learned how import a large data set, separated into 4 tables, into a database management system, SQL Server. The next post will feature the next step in accessing Big Data in R, the database connection. There we shall use SQL Server as a data warehouse and R as the analytics tool. Stay tuned.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Wednesday, September 24, 2014

Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations


Hello Readers,

We settle into Part 4 of the Text Analysis Series with Python by examining frequency distributions, word selections, and collocations. As a refresher, collocations are a sequence of words that occur together unusually often, given individual word frequencies. 

A specific case would be a pair of words forming a collocation in a bigram, such as 'red wine'. Bravo or brava, a trigram sequence would be three words long. 'red' and 'wine' would occur together quite often, as opposed to generic 'the wine', and 'maroon wine' would make little sense. That demonstrates how collocations are resistant to substitutions- because only those certain words carry that meaning, so those specific words are used. And only those words, so they occur quite frequently together.

Before we jump into collocations, let's start with frequency distributions. In the 'nltk' module in Python, we have a number of text corpora available for analysis. Load it, and get yourself comfortable for the ride.


Frequency Distributions


So we did some counting in a previous NLP post. We will count in this post as well, but in a different process. We aim to quantify each unique token in a given text corpus. How many times does 'government' occur in the Inaugural Address corpus? What about 'lol' in the Internet Chat corpus? Use the 'FreqDist()' method! As we see below, 'FreqDist()' takes the text and creates a frequency distribution for the unique tokens, and they aren't all words. They can be periods, parenthesis, commas, etc. 


With the summary of 'fdist1', we observe that it has 19,317 samples, or tokens, and 260,819 total counts, or length of the text.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

#####
# frequency distributions

# create FreqDist object
In [3]: fdist1 = FreqDist(text1)

# summary of FreqDist object, 19317 unique tokens with 260,819 total tokens
In [4]: fdist1
Out[4]: <FreqDist with 19317 samples and 260819 outcomes>

# retrieve set of tokens
In [5]: vocab1 = fdist1.keys()

# display first 10 of set
In [6]: vocab1[:10]
Out[6]:
[u'funereal',
 u'unscientific',
 u'divinely',
 u'foul',
 u'four',
 u'gag',
 u'prefix',
 u'woods',
 u'clotted',
 u'Duck']

# display number of occurrences for 'whale' token
In [7]: fdist1['whale']
Out[7]: 906

# plot first 20 terms
In [8]: fdist1.plot(20, cumulative=False)

Accessing the '.keys()' method, we can assign the unique token to 'vocab1' and look into the first 10 elements. We see words such as 'funereal', 'unscientific', and 'divinely'. Lastly we can look up the count of a word. Take 'whale', since text1 is Moby Dick, and we see that text1 has 906 occurrences of 'whale'.


Lastly, the 'FreqDist' object has a plot function. We specify the number of terms, and whether the plot is cumulative or not, and Python returns Figure 1.


Figure 1. Frequency Distribution of 20 Terms in Text1

Notice how 'four' has the highest count in this sample of 20 terms, with 'hanging' coming in second. The words you see with barely any count have a count of 1- they only occur in the text corpus once! These special terms are named hapaxes (hapax singular). nltk has a special function just for identifying hapaxes. You guessed it, '.hapaxes()'!


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
######
# hapaxes are words that occur only once

# display first 10 occurrences
In [8]: fdist1.hapaxes()[:10]
Out[8]:
[u'funereal',
 u'unscientific',
 u'prefix',
 u'plaudits',
 u'woody',
 u'disobeying',
 u'Westers',
 u'DRYDEN',
 u'Untried',
 u'superficially']
 
# display total number of single occurrences 
In [9]: len(fdist1.hapaxes())
Out[9]: 9002

The first 10 hapaxes can be sliced from the function, and we can also see how many unique terms are in text1 by passing the result to the 'len()' method. We see that text1 has 9002 terms which occur only once.



Word Selections


Now we will take advantage of Python looping through iterable objects to select words with certain attributes of word length and occurrences in a corpus. This way we can search for defining words which capture the essence of a corpus, or track trends in word usage. We can throttle the word count or word length to our needs in different circumstances.


Word Length
We can create an if condition to accept only words with more than 15 characters in the set of unique tokens from text1, as shown below. Some long words which occur are: 'CIRCUMNAVIGATION', 'Physiognomically', and 'apprehensiveness'.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# using set theory to select words
# lengthy words of 15 characters or more
# {w | w E V & P(w)}
# [w for w in V if P(w)]
# the set of all w such that w is an element of V (vocab) and has property P

# get set of vocab in text1
In [10]: V = set(text1)

# iterate through V, grabbing each word with character length greater than 15
In [11]: long_words = [w for w in V if len(w) > 15]

# display sorted first 10 lengthy words
In [12]: sorted(long_words)[:10]
Out[12]:
[u'CIRCUMNAVIGATION',
 u'Physiognomically',
 u'apprehensiveness',
 u'cannibalistically',
 u'characteristically',
 u'circumnavigating',
 u'circumnavigation',
 u'circumnavigations',
 u'comprehensiveness',
 u'hermaphroditical']

#####
# looking at internet long word patterns
# more than 15 characters

# check which text number is internet chat, #5 
In [13]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create unique vocab set
In [14]: vocab = set(text5)

# iterate through vocab for words greater than 15 characters in length
In [15]: long_chat_words = [word for word in vocab if len(word) > 15]

# display first 10 sorted
In [16]: sorted(long_chat_words)[:10]
Out[16]:
[u'!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'#talkcity_adults',
 u'(((((((((((((((((',
 u'((((((((((((((((((',
 u'((((((((((((((((((((']

# display 101st to 110th sorted, no results 
In [17]: sorted(long_chat_words)[100:111]
Out[17]: []

# index from last for last 10
# observe exaggerated chat patterns
In [18]: sorted(long_chat_words)[-10:]
Out[18]:
[u'oooooooooooooonnnnnnnnnnnneeeeeeeeeeeeeeesssssssss',
 u'raaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 u'tatatatnanaantatat',
 u'weeeeeeeeeeeeeeee',
 u'weeeeeeeeeeeeeeeeeeeeeeeeed',
 u'wheeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 u'woooooooooaaaahhhhhhhhhhhh',
 u'wooooooooooooohoooooooooooooooo',
 u'www.Wunderground.com',
 u'yuuuuuuuuuuuummmmmmmmmmmm']

Furthermore, we use the Internet Chat corpus, text5, to examine some words with long length. The first few are simply exclamation points, while the last few are 'overspelled' for dramatic effect.


Word Length and Frequency
With word length, we could consider another attribute to select words from a corpus. We could use the word frequency. Even if a word is long winded, if it occurs more than a few times it could be indicative of an important word in the corpus. So we include the count from the frequency distribution. 

Looking at the Internet Chat corpus (text5) again, we select for the word length to be more than 8, and a frequency higher than 5. Keep in mind we have to create the frequency distribution from text5 first, and use the set of text5 as the iterable variable.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#####
# which words typify a text? 
# the long words or the single occurrences? (hapaxes)
# what about frequently occurring long words?

# display text names
In [19]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create FreqDist object for text5
In [20]: fdist5 = FreqDist(text5)

# sort words iterated through set of text5
# having character length more than 8, and occurring more than 5 times
In [25]: selected_words = sorted([w for w in set(text5) if len(w) > 8 and fdist5[w] > 5])

# display words selected on minimum word length and occurrence
In [26]: selected_words
Out[26]:
[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u')))))))))))',
 u')))))))))))))',
 u'.........',
 u'Compliments',
 u'cute.-ass',
 u'everybody',
 u'everything',
 u'listening',
 u'seriously',
 u'something',
 u'sometimes']

The results vary in content, as Internet Chat is not censored, as you might discover if you delve deeper into the text.



Bigrams & Collocations


Here we arrive at the word pairs and special word pairs. The 'bigrams()' method creates pairings of words as it iterates through the text, combining adjacent words. Collocations pull those word pairs which exist together unusually frequently, and you might find that they have a particular meaning when seen together and are not descriptive when apart.

The bigram for 'more is said than done' is shown below. Note that the adjacent words are paired. For collocations, use the '.collocations()' method on a text corpus to retrieve the list of collocation terms. Looking at text4 Inaugural Address corpus, 'United States', 'fellow citizens', 'four years', and such are mentioned in those combinations more than not. 'Indian tribes', and 'Chief Justice', are terms which occur together.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# collocation : sequence of words that occur together unusually often
# ex: 'red wine', as opposed to 'the wine'
# bigram : word pairs 

# create bigram list
In [34]: bigram1 = list(bigrams(['more','is','said','than','done']))

# display bigram, note pairs of words
In [35]: bigram1
Out[35]: [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

# collocation for text4
# bigrams with words that occur together more frequently 
# than expected based on frequency of individual words
In [37]: text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

# collocations for personals corpus
In [38]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# display personals collocations
In [39]: text8.collocations()
Building collocations list
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

For the personals corpus, text8, we encounter word pairs in personal advertisements online. Likely and logical word combinations such as 'medium build', 'social drinker', 'quiet nights', easy going', 'financially secure', and 'permanent relationship' are paired together due to the nature of personality and lifestyle description. Simply put, those words go together, like peanut butter and jelly, although I am a 'peanut butter will do' person. The pair 'peanut butter' would be a collocation in a baking recipe or snacks corpus.


Wow, there goes another post! I know there was only one graph this time, so thanks for making it this far. Here we learned about frequency distributions, different ways to select words from a text corpus, and word selection tools to create bigrams and collocations in Python. This is just the beginning, and there is lots more text analysis to cover! So stay tuned.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

Monday, September 15, 2014

Python and Pandas: Part 4. More Baby Names


Hello Readers,

This post continues directly from exploring baby names in Part 3 of the Python and Pandas Series. In 2007, Laura Wattenburg of babynamewizard.com discovered a peculiar trend in baby names, specifically the last letters in the names of newborns. We will extend the same U.S. Social Security Administration data we used previously in Part 3 to uncover, and verify the same trends. Also, we will search for any names which have switched from being assigned to girls to being assigned to guys, and reverse.

Check out Part 3 for the code and information on downloading, loading, and (re)formatting the data. Here in Part 4, we will use both the 'top1000' and original 'names' DataFrames. That is why we pickled the DataFrames previously- I did suggest that it was a good idea to save your progress (hint hint)! Start Python and let's begin.


Isolating the Last Letter


Since we want to include all the names, we will recover and pull the last letters from the main 'name' DataFrame (rhyming points). Import 'pylab', 'pandas', and 'numpy'. Use an anonymous function, lambda, to create a quick, custom method to pull the last letter from each name, then '.map()' it.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# extract last letters in names

C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02

C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: import pylab

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: names = pd.load('names.pkl')

# use lambda function

In [5]: get_last_letter = lambda x: x[-1]

In [7]: last_letters = names.name.map(get_last_letter)

In [9]: last_letters.name = 'last_letter'

# create pivot table last_letters in rows

In [10]: table = names.pivot_table('births', rows=last_letters, \
   ....: cols=['sex', 'year'], aggfunc=sum)
   
# subset only certain years
   
In [13]: subtable = table.reindex(columns=[1910,1960,2010], level='year')

In [14]: subtable.head()
Out[14]:
sex               F                      M
year           1910    1960    2010   1910    1960    2010
last_letter
a            108376  691247  670605    977    5204   28438
b               NaN     694     450    411    3912   38859
c                 5      49     946    482   15476   23125
d              6750    3729    2607  22111  262112   44398
e            133569  435013  313833  28655  178823  129012

With the 'last_letters' in tow, create a pivot table with 'births' as values, and rows as 'last_letters' to see a different table than we have seen previously. Instead of an aggregation of names, we have an aggregation of last letters, by sex and year. Subset this table by specifying 3 dates: 1910, 1960, and 2010, for a quick eyeball trend.



Last Letter Proportion


Now that we have the births by last letters, years, and sex in a pivot table, we can go ahead and plot the data. Before we do so, we should normalize the data so we can compare the births between the sex and year groupings. Taking the '.sum()' of the 'subtable' with 3 interval years yields the sum of the births in the groupings, which we will use to divide the births to obtain the proportion.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# create proportion out of total in each year and sex

In [15]: subtable.sum()
Out[15]:
sex  year
F    1910     396416
     1960    2022062
     2010    1759010
M    1910     194198
     1960    2132588
     2010    1898382
dtype: float64

In [16]: letter_prop = subtable / subtable.sum().astype(float)

In [17]: import matplotlib.pyplot as plt

In [19]: fig, axes = plt.subplots(2, 1, figsize=(10,8))

In [20]: letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
Out[20]: <matplotlib.axes.AxesSubplot at 0x83ca3d0>

In [21]: letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
Out[21]: <matplotlib.axes.AxesSubplot at 0xf050170>

Only then can we plot the two figures, separated by male and female plots, with the x-axis for letters, and y-axis for proportion of the 3 years in different colored bars. So here we will explore how all the letters compare over a time period of 60 years. Note in inputs 20 and 21, that we add the plots to the two blank plot spaces separately, which gives us Figure 1.

Figure 1. Male and Female Baby Name Last Letter Proportions

The legend is the same for both male and female plots, so only one is required. From the three selected years 1910, 1960, and 2010, we see different distributions in last letters between male and female letters. For females, 'a', 'e', 'y', and 'n' were mostly popular throughout the years, and the males matched in 'n'. An argument could be made for 'e' and 'y' for males. But we do see an unusual surge in the last letter of 'n' for male names in 2010. Male last letters were more even distributed across the alphabet, compared to female last letters.



Select Letters Throughout the Years


In the above analysis we examined all the letters and picked a spread of years to analyze the last letters. Here, we flip it around, and look at all the years with selected letters. Since we saw some wild fluctuations in proportion over the 60 year intervals, and looking closer at each year will shed light on the rise and fall of last letters, such as the male last letter 'n'.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# normalize by year and sex
# subset last letters of boy names

In [25]: letter_prop = table/table.sum().astype(float)

In [26]: dny_ts = letter_prop.ix[['d','n','y'],'M'].T

In [29]: dny_ts.head()
Out[29]:
             d         n         y
year
1880  0.083055  0.153213  0.075760
1881  0.083247  0.153214  0.077451
1882  0.085340  0.149560  0.077537
1883  0.084066  0.151646  0.079144
1884  0.086120  0.149915  0.080405

In [30]: dny_ts.plot(title='Selected last letters of male baby names')

Out[30]: <matplotlib.axes.AxesSubplot at 0xf3d1690>

# last letter female names

In [34]: lny_ts = letter_prop.ix[['l','n','y'],'F'].T

In [35]: lny_ts.plot(title='Selected last letters of female baby names')
Out[35]: <matplotlib.axes.AxesSubplot at 0xf3aa810>

Switching back to the table with all the years, we apply the proportion conversion into a new table, appropriately named 'letter_prop'. Then select your letters- I selected 'd', 'n', 'y' for males, making sure to include the suspicious letter 'n'. To get the years to the rows, we transpose the table with '.T', and plot. Simple, right?


As expected, 'd' and 'y' both peaked (1938, and 1960, respectively), then fell. The last letter 'n' on the other hand, remained steady and started to rise dramatically after 1960, to surpass 36% of all last letters in male baby names.


Figure 2. Male Last Letter Proportion (d,n,y)

For female names, I selected 'l', 'n', and 'y', and you can select which ones you prefer, and can be more than three. We can see a clearer picture than in Figure 1, where we only had 3 years to see the proportions of all the letters. It appears that 'y' as a last letter in a female name, along with 'n', both peaked twice. Compare those two to 'l', 'l' has a relatively stable representation throughout the years, with modest peaks in 1897, 1945, and 1983.



Figure 3. Female Last Letter Proportion (l,n,y)


Name Flipping


Names such as Jamie, Leslie, or Madison come to mind when thinking of names given to boys and girls. Has a name ever switched predominately belonging to one sex to the other? Let's look at the name, Leslie and variants of the spelling. Here we will use the 'top1000' data set.

Obtain the set of the names by using '.unique()'. To find all the variants of Leslie, iterate through the list of 'all_names' for 'lesl', and index it through 'all_names' to see what it picked up. Leslie, Lesley, Leslee, Lesli, and Lesly are all variants in the data, of the name Leslie. Now we take this array and index the name variable of 'top1000', finding any matching names in the array with '.isin()'. 

In DataFrame 'filtered', we have all the years with only the qualifying Leslie name variants in the name variable. Grouping by name, make a quick table with '.sum()' to see Leslie (370,429) as the most common spelling, followed by Lesley (35,022) and Lesly (10,067).

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# boy names that became girl names and reverse

In [36]: all_names = top1000.name.unique()

In [37]: mask = np.array(['lesl' in x.lower() for x in all_names])

In [39]: lesley_like = all_names[mask]

In [40]: lesley_like
Out[40]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

In [41]: filtered = top1000[top1000.name.isin(lesley_like)]

In [42]: filtered.groupby('name').births.sum()
Out[42]:
name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64

In [43]: table = filtered.pivot_table('births', rows='year', \
   ....: cols='sex', aggfunc=sum)

In [44]: table = table.div(table.sum(1), axis=0)

In [45]: table.tail()
Out[45]:
sex   F   M
year
2006  1 NaN
2007  1 NaN
2008  1 NaN
2009  1 NaN
2010  1 NaN

In [46]: table.plot(style={'M': 'k-', 'F': 'k--'})
Out[46]: <matplotlib.axes.AxesSubplot at 0xf3e9250>

To format the data for a plot, create a pivot table of the birth sum values with the years as rows, and columns as sex. Again, to normalize the values, we divide the values by the sum of each row, thereby finding the percentage of male and female for each year of each Leslie name variant. Using '.tail()' to print the last 5 values, we see that from 2006 to 2010, Leslie & Friends were all assigned to females. Has it changed historically? Specify the male and female line styles, and plot it to find out!


We discover that around 1955, parents began to name their more female babies Leslie and its name variants. What started out as a name more common to males in the 1880's to 1940's, the trend began to reverse. Today and recently, more females are named Leslie, Lesley, or Lesly, etc. than male babies.


Figure 4. Leslie and Variants in Female and Male Baby Names


Hope you guys enjoyed this Python post about baby names! Here we examined some in depth metrics, such as the last letter in each name. It takes some creativity to wrangle with data in order to discover trending insights. What else can you measure from the names and birth data? Feel free to comment below.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.

Saturday, September 13, 2014

Python and Pandas: Part 3. Baby Names, 1880-2010


Hello Readers,

Here in the third part of the Python and Pandas series, we analyze over 1.6 million baby name records from the United States Social Security Administration from 1880 to 2010. A particular name must have at least 5 occurrences for inclusion into the data set. We will explore reading in multiple raw data files, merging them into one DataFrame, subsetting desired portions of the data, creating new variable metrics, and visualizing results.

As usual, start IPython in your command prompt if you want to follow along. You can find the data here, under National data (it unzips to 'names' folder). Let's jump in.


Data Preview


The baby name files are split by year of birth, all in a similar format: 'yob1880.txt', 'yob1881.txt', and so on to 'yob2010.txt'. You can go ahead and import 'pandas', 'pylab', and 'numpy' modules now or when they required later.

Use the '.read_csv()' method to access the first text file of baby names in 1880. We see that there were 2,000 boy and girl names from the data that year (n>=5), with 3 variables: the name, the sex of the baby, and the birth count for that name.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02

C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: import pandas as pd

In [2]: names1880 = pd.read_csv('names\yob1880.txt',names=['name','sex','births'])

In [3]: names1880
Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 3 columns):
name      2000  non-null values
sex       2000  non-null values
births    2000  non-null values
dtypes: int64(1), object(2)

In [4]: names1880.groupby('sex').births.sum()
Out[4]:
sex
F       90993
M      110493
Name: births, dtype: int64

Performing a quick tab, we group the data by 'sex' and view the count of 'births'. There are 90,993 girls, and 110,493 boys in the 1880 data.



Assembling the Data


Now that we have an idea of the data contents, and we know the pattern of the text file names, we can create a loop to read in the data. At the same time, we add another variable denoting the year for a particular name entry for when all the years are together.

Create a 'years' variable which we will use to iterate through each year text file. Then we read the data, add a 'year' column, append the data to DataFrame 'pieces', then merge them together. Using the '%d' string formatter, we can replace that space with a given variable, 'year'. After using the '.append()' method to add the current 'frame' object to 'pieces', we take advantage of '.concat()' to merge the frames in 'pieces' by row for a completed DataFrame in 'names'. 'ignore_index' should be True because we do not want to keep the original indexes.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# data split by year
# so assemble data into single DataFrame and add year field
# use %d string formatter to iterate through the years
# add 'year' column
# append pieces together
# .concat merges by row and do not preserve original row numbers

In [9]: pieces = []

In [10]: columns = ['name', 'sex', 'births']

In [11]: years = range(1880, 2011)

In [12]: for year in years:
   ....:     path = 'names/yob%d.txt' % year
   ....:     frame = pd.read_csv(path, names=columns)
   ....:     frame['year'] = year
   ....:     pieces.append(frame)
   ....:     names = pd.concat(pieces, ignore_index=True)
   ....:

In [13]: names
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
dtypes: int64(2), object(2)

# 1,690,783 rows of data with 4 columns

In [14]: names.save('names.pkl')

In the 'names' DataFrame, we have 1,690,783 names from years 1880 to 2010 with 4 columns, including the year. Remember to pickle the DataFrame with '.save()', in other words, save it. It is a hefty file, around 63 MB in size, but Python will do all the heavy lifting!



Exploring the Data


First off, a pivot table is in order. Let's move the 'sex' to the columns, and the 'year' in the rows, while positioning the 'births' in values. Calling '.tail()' will give us the last 5 rows in the table. To get a bigger picture, plot the table of births stratified by sex and year.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
In [17]: total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)

In [19]: total_births.tail()
Out[19]:
sex         F        M
year
2006  1896468  2050234
2007  1916888  2069242
2008  1883645  2032310
2009  1827643  1973359
2010  1759010  1898382

In [20]: total_births.plot(title='Total Births by sex and year')
Out[20]: <matplotlib.axes.AxesSubplot at 0x1a5f4730>

Figure 1. Total births by sex and year
We can observe the birth trends rise and fall based on economic trends- they births tend to fall in times of recession, and male births started to outpace female births after WWII.


Adding Proportion and Subsetting Top 1000 Names


Here we add the column of proportions to our 'names' DataFrame. The proportions will be the number of births out of each total births grouped by year and sex. We define a new method, 'add_prop()' and convert the argument value to a float type for non-integer division purposes. Then we divide the births by the sum of births in the grouping, and return the number.

Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# group by year and sex
# add proportion of babies given certain name relative to number of births

In [23]: def add_prop(group):
   ....:     # integer division floors
   ....:     births = group.births.astype(float)
   ....:     group['prop'] = births / births.sum()
   ....:     return group
   ....:

In [24]: names = names.groupby(['year','sex']).apply(add_prop)

In [25]: names
Out[25]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 5 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
prop      1690784  non-null values
dtypes: float64(1), int64(2), object(2)

# check to see values by group sum to 1

In [28]: import numpy as np

In [29]: np.allclose(names.groupby(['year','sex']).prop.sum(),1)
Out[29]: True

# subset top 1000 births

In [30]: def get_top1000(group):
   ....:     return group.sort_index(by='births', ascending=False)[:1000]
   ....:

In [31]: grouped = names.groupby(['year','sex'])

In [32]: top1000 = grouped.apply(get_top1000)

In [37]: top1000
Out[37]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 261877 entries, (1880, F, 0) to (2010, M, 1677643)
Data columns (total 5 columns):
name      261877  non-null values
sex       261877  non-null values
births    261877  non-null values
year      261877  non-null values
prop      261877  non-null values
dtypes: float64(1), int64(2), object(2)

We pass the groups by year and sex to the 'add_prop()' method using '.apply()'. Confirming the new 'prop' column, the new 'names' DataFrame now has 5 columns. To ensure the birth proportions by groups are accurate, we verify using '.allclose()' method in the 'numpy' module, and compare the sum to 1. Python returns 'True', and we are assured the column values are correct.


With this new DataFrame, we now will subset the top 1000 names by birth in each year and sex grouping. Define a new method, 'get_top1000()', and which sorts the births in descending order, and returns the first 1000 entries. We pass the 'names' DataFrame grouped by year and sex to the 'get_top1000()' method into our new DataFrame, 'top1000'. Instead of over 1.6 million entries, we now have 261,877 entries with which to work.


Some Naming Trends


Because the data include information spanning from 1880 to 2010, we can examine trends in time of baby names for any changes. As usual, remember to save and pickle the 'top1000' DataFrame as we go along the analysis. Begin by separating sex into two different DataFrames for later use. 

Now create a pivot table from 'top1000', with births as summed values, years in rows, and names in the columns. There are 131 rows, one for each year and 6,865 columns, or names. We will subset by column, take only specific names, and plot the births for the selected names by year in a single plot. You can choose different names, and I chose John, Harry, Mary, and Marilyn as sample names.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# analyze naming trends

In [38]: top1000.save('top1000.pkl')

In [39]: boys = top1000[top1000.sex == 'M']

In [40]: girls = top1000[top1000.sex == 'F']

In [41]: total_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=sum)

In [42]: total_births
Out[42]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6865 entries, Aaden to Zuri
dtypes: float64(6865)

In [43]: subset = total_births[['John','Harry','Mary','Marilyn']]

In [46]: subset.plot(subplots=True, figsize=(12,10),grid=False,title='Number of births per year
')
Out[46]:
array([<matplotlib.axes.AxesSubplot object at 0x1A7144D0>,
       <matplotlib.axes.AxesSubplot object at 0x14C13E30>,
       <matplotlib.axes.AxesSubplot object at 0x1AA527F0>,
       <matplotlib.axes.AxesSubplot object at 0x172A05B0>], dtype=object)

# plot 'Wayne'
    
In [47]: subsetw = total_births[['Wayne']]

In [48]: subsetw.plot(title='Wayne')
Out[48]: <matplotlib.axes.AxesSubplot at 0x14c0f2f0>

# names growing out of favor?

Figure 2. Number of Births per Year, Selected Names
For John, Harry, and Mary, they have bimodal peaks about 1920s and 1950s. For Marilyn, the name became steadily popular from the 1930's to the late 1950's. For all four names, we observe a fall in births per year. Are those names really becoming more uncommon? We will discover what is happening in the data below.

Curious, I plotted my name to see the birth time trends of 'Wayne'. It follows the same rise, peak, and fall around the 1950's, though it followed less of a bimodal distribution.


Figure 3. Number of Births per Year, For Name: Wayne

Baby Name Diversity


First, a spoiler: the drop in births for certain names have something to do with the name diversity- what parents choose to name their child. The trend changes from 1950's onwards. To examine this change, we turn to the variable we created earlier, the proportion of births in each group by year and sex. So we create a pivot table from the 'top1000', but this time with the sum values as 'prop', 'year' as rows, and 'sex' as columns.

This will allow us to plot Figure 4. Note how proportion total starts at 1.0 in 1880, and slowly drops in 1960 for females and in 1970 for males. The decline in proportion of births accounted by the top 1000 names has declined to around 74% for females and 85% for males by 2010. That means the share of births for other names outside of the top 1000 has risen. More parents are choosing different, more uncommon names to call their newborns.


Figure 4. Proportion of Top Births by Year and Sex
We can check this by examining the boys and girls DataFrames we created earlier. We subset the year 2010, sort by proportion in descending order, then take the cumulative sum of the proportions of births. Taking the first 10 names, we see that the top name were roughly 1.15% of the total male births in 2010. Using '.searchsorted(0.5)' to find the sorted index of the 50th percentile, Python returns 116. Therefore 117 names consist of 50% of the male births in 2010. We compare this number to the 50% percentile of male births in 1880, which is 25. From 1880 to 2010, the number of names in the top 50% percentile of male births increased over 350% from 25 to 117. Male name diversity sure increased over the years.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# exploring increases in naming diversity
# fewer parents choosing common names for children

In [15]: table = top1000.pivot_table('prop', rows=
   ....: 'year',cols='sex', aggfunc=sum)

In [17]: table.plot(title='Sum of table1000.prop by year and sex', \
   ....: yticks=np.linspace(0,1.2,13), xticks=range(1880,2020,10))
Out[17]: <matplotlib.axes.AxesSubplot at 0x6a1ead0>

# names proportion going down from 1 from top 1000 names

In [20]: df = boys[boys.year==2010]

In [21]: df
Out[21]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1000 entries, (2010, M, 1676644) to (2010, M, 1677643)
Data columns (total 5 columns):
name      1000  non-null values
sex       1000  non-null values
births    1000  non-null values
year      1000  non-null values
prop      1000  non-null values
dtypes: float64(1), int64(2), object(2)

In [22]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()

In [23]: prop_cumsum[:10]
Out[23]:
year  sex
2010  M    1676644    0.011523
           1676645    0.020934
           1676646    0.029959
           1676647    0.038930
           1676648    0.047817
           1676649    0.056579
           1676650    0.065155
           1676651    0.073414
           1676652    0.081528
           1676653    0.089621
dtype: float64

In [24]: prop_cumsum.searchsorted(0.5)
Out[24]: 116

# index 116, so 117 names in top 50% in 2010

In [25]: df1900 = boys[boys.year==1900]

In [26]: prop1900 = df1900.sort_index(by='prop', ascending=False).prop.cumsum()

In [27]: prop1900.searchsorted(0.5)+1
Out[27]: 25

# in 1900, top 50% of names covered with 25 names
# so there is a large increase in name diversity

In [28]: def get_quantile(group, q=0.5):
   ....:     group = group.sort_index(by='prop', ascending=False)
   ....:     return group.prop.cumsum().searchsorted(q)+1
   ....:

In [29]: diversity = top1000.groupby(['year','sex']).apply(get_quantile)

In [30]: diversity = diversity.unstack('sex')

In [31]: diversity.head()
Out[31]:
sex    F   M
year
1880  38  14
1881  38  14
1882  38  15
1883  39  15
1884  39  16

In [32]: diversity.plot(title='Number of popular names in top 50%')
Out[32]: <matplotlib.axes.AxesSubplot at 0x2075c1f0>

This diversity increase can be said for female births as well. Rather than take the number of names in the top 50th percentile for 1880 and 2010, we calculate them for all the years, and both male and female names. Define a new function, 'get_quantile(group, q=0.5)', which sorts the 'group' argument by variable 'prop' in descending order, and returns the index of the sorted cumulative sum value at 0.5, adding 1 at the end to account for the index. 


Pass this method through the top 1000 names grouped by year and sex into the 'diversity' DataFrame. Reconfigure the DataFrame table by placing 'sex' in the columns with '.unstack()' to finalize the table. Take a peak at the first 5 years of the 'diversity' data with '.head()', and observe that 38 female names and 14 male names accounted for 50% of the top births in 1880. Lastly we plot the 'diversity' DataFrame, shown below.


Figure 5. Popular Baby Names in 50% percentile

We see a distinct increase in name diversity around 1985 for both males and females. Historically, female names were more diverse than male names. By 2010, the number of top female names accounting for the top 50 birth percentile more than doubled the male name counterpart.

Again, we reach the end of another lengthy, but I hope, enjoyable post in Python and Pandas concerning baby names. We explored and manipulated a dataset of 1.6 million rows, re-organized DataFrames, created new variables, and visualized various name metrics, all after accessing data split into 131 text files. There is more on baby names we will explore in Part B of this post. So stay tuned for more!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.