Pages

Monday, September 29, 2014

Handling Large Data with SQL Server


Hello Readers,

Big Data is all the rage these days. On this blog we have mainly used R as our analysis program of choice (sorry Python) to examine, model, and predict on the data. R is optimal for data with hundred thousands of rows or less, and dealing with larger data sets with millions of rows or more usually slows R down to a crawl. (Remember the neural network post?) Another deciding factor on computation speed is your computer setup. 

Because there are millions of rows in big data sets, we need only sample a few hundred thousand of them into R. But where will we store the millions of rows prior to sampling? Easy, into a data warehouse made for million plus row data sets, a SQL Server database. 

We have not dealt with hundreds of millions or billions of rows, so resist the urge to mention the Hadoop distributed file system. Or should I say not yet? Anyways, here we will load million plus row U.S. Census data sets in preparation for R sampling.


Getting Census Data


Specifically we concentrate on two data sets, U.S. Census American Community Survey 2011, where we can find nationwide population and housing data. You can download the data via ftp here, or choose 2011 ACS 1-year PUMS from the ACS site. The ACS encompasses social, economic, demographic, and housing data. A screenshot of the page is shown below:


Figure 1. American Community Survey (Census) Webpage
After clicking the red highlighted link, 2011 ACS 1-year PUMS, we choose from CSV or SAS file format. Take your pick, I chose CSV for importing into SQL Server.


Figure 2. Choose File Format - CSV
Next, the site gives us choices whether to download the entire dataset for the U.S. or individual states. Of course, download both the Population and Housing data for the entire U.S. Be warned, these two files take up nearly 1 GB of hard drive space, so be patient! Download accelerator anyone?


Figure 3. Download Population and Housing Data Files
After we have the zip files downloaded, unzip them. Note that the housing data 'ss11hus*' and population data 'ss11pus*' have parts a and b.


Figure 4. Unzipped Files
You can also find the helpful README file for the data at the ACS site as well.


Importing into SQL Server


Now for the crucial part, where we import the data into a database system, SQL Server, instead of reading the CSV files into R directly. After connecting to your server, select which database where you would import the CSV files. Right-click and under Tasks, > choose Import Files.

Figure 5. Import Files Option
Next we specify the data source. The CSV files are not Excel or Access files, rather they are a Flat File Source. So select the appropriate option, and select a CSV file from the unzipped file lcoation. On the left you can see a Columns view option under General. You can preview the file contents prior to importing it.

Figure 6. Select Flat File Source
Figure 7 shows the Columns for our preview, so we can ensure the proper headers and values are read by the SQL Server Wizard.

Figure 7. Column Previews
Even though we right-clicked a certain database, SQL Server still asks us in which database we want to import the data. You can change your mind, if you clicked wrong earlier, so take your pick.

Figure 8. Table Destination
Here we encounter the Source Table Option window. We can alter the schema and name of table with Edit Mapping at the bottom. The current schema and name are set as 'dbo' and 'ss11pusa', respectively. If you want to use a different schema than the default, now is the time to impose the change.

Figure 9. Select Source Tables
SQL Server gives us a last overview option before we start the importing process. Ready? Click Finish.

Figure 10. Last Check
And here we go! Look at the Import Wizard chug along as it imports over a million rows!

Figure 11. Import Progressing
When the process is finished, we are rewarded with... a new table lots of rows. Take a look below. 1.6 million? Not bad. Now you just have to import the remaining 3 for a complete set!

Figure 12. Execution Successful
Remember to refresh the Server instance for it to reflect the new table changes. Now you can query away in a new query window to explore the data, or you can keep the Server on and jump into R to establish a database connection using ODBC (open database connectivity).

Figure 13. Locating Imported Table
I thought this post made up for the previous few posts' lack of pictures. Was the compensation count this time sufficient (13)? Here we learned how import a large data set, separated into 4 tables, into a database management system, SQL Server. The next post will feature the next step in accessing Big Data in R, the database connection. There we shall use SQL Server as a data warehouse and R as the analytics tool. Stay tuned.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Wednesday, September 24, 2014

Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations


Hello Readers,

We settle into Part 4 of the Text Analysis Series with Python by examining frequency distributions, word selections, and collocations. As a refresher, collocations are a sequence of words that occur together unusually often, given individual word frequencies. 

A specific case would be a pair of words forming a collocation in a bigram, such as 'red wine'. Bravo or brava, a trigram sequence would be three words long. 'red' and 'wine' would occur together quite often, as opposed to generic 'the wine', and 'maroon wine' would make little sense. That demonstrates how collocations are resistant to substitutions- because only those certain words carry that meaning, so those specific words are used. And only those words, so they occur quite frequently together.

Before we jump into collocations, let's start with frequency distributions. In the 'nltk' module in Python, we have a number of text corpora available for analysis. Load it, and get yourself comfortable for the ride.


Frequency Distributions


So we did some counting in a previous NLP post. We will count in this post as well, but in a different process. We aim to quantify each unique token in a given text corpus. How many times does 'government' occur in the Inaugural Address corpus? What about 'lol' in the Internet Chat corpus? Use the 'FreqDist()' method! As we see below, 'FreqDist()' takes the text and creates a frequency distribution for the unique tokens, and they aren't all words. They can be periods, parenthesis, commas, etc. 


With the summary of 'fdist1', we observe that it has 19,317 samples, or tokens, and 260,819 total counts, or length of the text.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

#####
# frequency distributions

# create FreqDist object
In [3]: fdist1 = FreqDist(text1)

# summary of FreqDist object, 19317 unique tokens with 260,819 total tokens
In [4]: fdist1
Out[4]: <FreqDist with 19317 samples and 260819 outcomes>

# retrieve set of tokens
In [5]: vocab1 = fdist1.keys()

# display first 10 of set
In [6]: vocab1[:10]
Out[6]:
[u'funereal',
 u'unscientific',
 u'divinely',
 u'foul',
 u'four',
 u'gag',
 u'prefix',
 u'woods',
 u'clotted',
 u'Duck']

# display number of occurrences for 'whale' token
In [7]: fdist1['whale']
Out[7]: 906

# plot first 20 terms
In [8]: fdist1.plot(20, cumulative=False)

Accessing the '.keys()' method, we can assign the unique token to 'vocab1' and look into the first 10 elements. We see words such as 'funereal', 'unscientific', and 'divinely'. Lastly we can look up the count of a word. Take 'whale', since text1 is Moby Dick, and we see that text1 has 906 occurrences of 'whale'.


Lastly, the 'FreqDist' object has a plot function. We specify the number of terms, and whether the plot is cumulative or not, and Python returns Figure 1.


Figure 1. Frequency Distribution of 20 Terms in Text1

Notice how 'four' has the highest count in this sample of 20 terms, with 'hanging' coming in second. The words you see with barely any count have a count of 1- they only occur in the text corpus once! These special terms are named hapaxes (hapax singular). nltk has a special function just for identifying hapaxes. You guessed it, '.hapaxes()'!


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
######
# hapaxes are words that occur only once

# display first 10 occurrences
In [8]: fdist1.hapaxes()[:10]
Out[8]:
[u'funereal',
 u'unscientific',
 u'prefix',
 u'plaudits',
 u'woody',
 u'disobeying',
 u'Westers',
 u'DRYDEN',
 u'Untried',
 u'superficially']
 
# display total number of single occurrences 
In [9]: len(fdist1.hapaxes())
Out[9]: 9002

The first 10 hapaxes can be sliced from the function, and we can also see how many unique terms are in text1 by passing the result to the 'len()' method. We see that text1 has 9002 terms which occur only once.



Word Selections


Now we will take advantage of Python looping through iterable objects to select words with certain attributes of word length and occurrences in a corpus. This way we can search for defining words which capture the essence of a corpus, or track trends in word usage. We can throttle the word count or word length to our needs in different circumstances.


Word Length
We can create an if condition to accept only words with more than 15 characters in the set of unique tokens from text1, as shown below. Some long words which occur are: 'CIRCUMNAVIGATION', 'Physiognomically', and 'apprehensiveness'.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# using set theory to select words
# lengthy words of 15 characters or more
# {w | w E V & P(w)}
# [w for w in V if P(w)]
# the set of all w such that w is an element of V (vocab) and has property P

# get set of vocab in text1
In [10]: V = set(text1)

# iterate through V, grabbing each word with character length greater than 15
In [11]: long_words = [w for w in V if len(w) > 15]

# display sorted first 10 lengthy words
In [12]: sorted(long_words)[:10]
Out[12]:
[u'CIRCUMNAVIGATION',
 u'Physiognomically',
 u'apprehensiveness',
 u'cannibalistically',
 u'characteristically',
 u'circumnavigating',
 u'circumnavigation',
 u'circumnavigations',
 u'comprehensiveness',
 u'hermaphroditical']

#####
# looking at internet long word patterns
# more than 15 characters

# check which text number is internet chat, #5 
In [13]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create unique vocab set
In [14]: vocab = set(text5)

# iterate through vocab for words greater than 15 characters in length
In [15]: long_chat_words = [word for word in vocab if len(word) > 15]

# display first 10 sorted
In [16]: sorted(long_chat_words)[:10]
Out[16]:
[u'!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'#talkcity_adults',
 u'(((((((((((((((((',
 u'((((((((((((((((((',
 u'((((((((((((((((((((']

# display 101st to 110th sorted, no results 
In [17]: sorted(long_chat_words)[100:111]
Out[17]: []

# index from last for last 10
# observe exaggerated chat patterns
In [18]: sorted(long_chat_words)[-10:]
Out[18]:
[u'oooooooooooooonnnnnnnnnnnneeeeeeeeeeeeeeesssssssss',
 u'raaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 u'tatatatnanaantatat',
 u'weeeeeeeeeeeeeeee',
 u'weeeeeeeeeeeeeeeeeeeeeeeeed',
 u'wheeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 u'woooooooooaaaahhhhhhhhhhhh',
 u'wooooooooooooohoooooooooooooooo',
 u'www.Wunderground.com',
 u'yuuuuuuuuuuuummmmmmmmmmmm']

Furthermore, we use the Internet Chat corpus, text5, to examine some words with long length. The first few are simply exclamation points, while the last few are 'overspelled' for dramatic effect.


Word Length and Frequency
With word length, we could consider another attribute to select words from a corpus. We could use the word frequency. Even if a word is long winded, if it occurs more than a few times it could be indicative of an important word in the corpus. So we include the count from the frequency distribution. 

Looking at the Internet Chat corpus (text5) again, we select for the word length to be more than 8, and a frequency higher than 5. Keep in mind we have to create the frequency distribution from text5 first, and use the set of text5 as the iterable variable.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#####
# which words typify a text? 
# the long words or the single occurrences? (hapaxes)
# what about frequently occurring long words?

# display text names
In [19]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create FreqDist object for text5
In [20]: fdist5 = FreqDist(text5)

# sort words iterated through set of text5
# having character length more than 8, and occurring more than 5 times
In [25]: selected_words = sorted([w for w in set(text5) if len(w) > 8 and fdist5[w] > 5])

# display words selected on minimum word length and occurrence
In [26]: selected_words
Out[26]:
[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u')))))))))))',
 u')))))))))))))',
 u'.........',
 u'Compliments',
 u'cute.-ass',
 u'everybody',
 u'everything',
 u'listening',
 u'seriously',
 u'something',
 u'sometimes']

The results vary in content, as Internet Chat is not censored, as you might discover if you delve deeper into the text.



Bigrams & Collocations


Here we arrive at the word pairs and special word pairs. The 'bigrams()' method creates pairings of words as it iterates through the text, combining adjacent words. Collocations pull those word pairs which exist together unusually frequently, and you might find that they have a particular meaning when seen together and are not descriptive when apart.

The bigram for 'more is said than done' is shown below. Note that the adjacent words are paired. For collocations, use the '.collocations()' method on a text corpus to retrieve the list of collocation terms. Looking at text4 Inaugural Address corpus, 'United States', 'fellow citizens', 'four years', and such are mentioned in those combinations more than not. 'Indian tribes', and 'Chief Justice', are terms which occur together.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# collocation : sequence of words that occur together unusually often
# ex: 'red wine', as opposed to 'the wine'
# bigram : word pairs 

# create bigram list
In [34]: bigram1 = list(bigrams(['more','is','said','than','done']))

# display bigram, note pairs of words
In [35]: bigram1
Out[35]: [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

# collocation for text4
# bigrams with words that occur together more frequently 
# than expected based on frequency of individual words
In [37]: text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

# collocations for personals corpus
In [38]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# display personals collocations
In [39]: text8.collocations()
Building collocations list
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

For the personals corpus, text8, we encounter word pairs in personal advertisements online. Likely and logical word combinations such as 'medium build', 'social drinker', 'quiet nights', easy going', 'financially secure', and 'permanent relationship' are paired together due to the nature of personality and lifestyle description. Simply put, those words go together, like peanut butter and jelly, although I am a 'peanut butter will do' person. The pair 'peanut butter' would be a collocation in a baking recipe or snacks corpus.


Wow, there goes another post! I know there was only one graph this time, so thanks for making it this far. Here we learned about frequency distributions, different ways to select words from a text corpus, and word selection tools to create bigrams and collocations in Python. This is just the beginning, and there is lots more text analysis to cover! So stay tuned.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.

Monday, September 15, 2014

Python and Pandas: Part 4. More Baby Names


Hello Readers,

This post continues directly from exploring baby names in Part 3 of the Python and Pandas Series. In 2007, Laura Wattenburg of babynamewizard.com discovered a peculiar trend in baby names, specifically the last letters in the names of newborns. We will extend the same U.S. Social Security Administration data we used previously in Part 3 to uncover, and verify the same trends. Also, we will search for any names which have switched from being assigned to girls to being assigned to guys, and reverse.

Check out Part 3 for the code and information on downloading, loading, and (re)formatting the data. Here in Part 4, we will use both the 'top1000' and original 'names' DataFrames. That is why we pickled the DataFrames previously- I did suggest that it was a good idea to save your progress (hint hint)! Start Python and let's begin.


Isolating the Last Letter


Since we want to include all the names, we will recover and pull the last letters from the main 'name' DataFrame (rhyming points). Import 'pylab', 'pandas', and 'numpy'. Use an anonymous function, lambda, to create a quick, custom method to pull the last letter from each name, then '.map()' it.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# extract last letters in names

C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02

C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: import pylab

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: names = pd.load('names.pkl')

# use lambda function

In [5]: get_last_letter = lambda x: x[-1]

In [7]: last_letters = names.name.map(get_last_letter)

In [9]: last_letters.name = 'last_letter'

# create pivot table last_letters in rows

In [10]: table = names.pivot_table('births', rows=last_letters, \
   ....: cols=['sex', 'year'], aggfunc=sum)
   
# subset only certain years
   
In [13]: subtable = table.reindex(columns=[1910,1960,2010], level='year')

In [14]: subtable.head()
Out[14]:
sex               F                      M
year           1910    1960    2010   1910    1960    2010
last_letter
a            108376  691247  670605    977    5204   28438
b               NaN     694     450    411    3912   38859
c                 5      49     946    482   15476   23125
d              6750    3729    2607  22111  262112   44398
e            133569  435013  313833  28655  178823  129012

With the 'last_letters' in tow, create a pivot table with 'births' as values, and rows as 'last_letters' to see a different table than we have seen previously. Instead of an aggregation of names, we have an aggregation of last letters, by sex and year. Subset this table by specifying 3 dates: 1910, 1960, and 2010, for a quick eyeball trend.



Last Letter Proportion


Now that we have the births by last letters, years, and sex in a pivot table, we can go ahead and plot the data. Before we do so, we should normalize the data so we can compare the births between the sex and year groupings. Taking the '.sum()' of the 'subtable' with 3 interval years yields the sum of the births in the groupings, which we will use to divide the births to obtain the proportion.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# create proportion out of total in each year and sex

In [15]: subtable.sum()
Out[15]:
sex  year
F    1910     396416
     1960    2022062
     2010    1759010
M    1910     194198
     1960    2132588
     2010    1898382
dtype: float64

In [16]: letter_prop = subtable / subtable.sum().astype(float)

In [17]: import matplotlib.pyplot as plt

In [19]: fig, axes = plt.subplots(2, 1, figsize=(10,8))

In [20]: letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
Out[20]: <matplotlib.axes.AxesSubplot at 0x83ca3d0>

In [21]: letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
Out[21]: <matplotlib.axes.AxesSubplot at 0xf050170>

Only then can we plot the two figures, separated by male and female plots, with the x-axis for letters, and y-axis for proportion of the 3 years in different colored bars. So here we will explore how all the letters compare over a time period of 60 years. Note in inputs 20 and 21, that we add the plots to the two blank plot spaces separately, which gives us Figure 1.

Figure 1. Male and Female Baby Name Last Letter Proportions

The legend is the same for both male and female plots, so only one is required. From the three selected years 1910, 1960, and 2010, we see different distributions in last letters between male and female letters. For females, 'a', 'e', 'y', and 'n' were mostly popular throughout the years, and the males matched in 'n'. An argument could be made for 'e' and 'y' for males. But we do see an unusual surge in the last letter of 'n' for male names in 2010. Male last letters were more even distributed across the alphabet, compared to female last letters.



Select Letters Throughout the Years


In the above analysis we examined all the letters and picked a spread of years to analyze the last letters. Here, we flip it around, and look at all the years with selected letters. Since we saw some wild fluctuations in proportion over the 60 year intervals, and looking closer at each year will shed light on the rise and fall of last letters, such as the male last letter 'n'.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# normalize by year and sex
# subset last letters of boy names

In [25]: letter_prop = table/table.sum().astype(float)

In [26]: dny_ts = letter_prop.ix[['d','n','y'],'M'].T

In [29]: dny_ts.head()
Out[29]:
             d         n         y
year
1880  0.083055  0.153213  0.075760
1881  0.083247  0.153214  0.077451
1882  0.085340  0.149560  0.077537
1883  0.084066  0.151646  0.079144
1884  0.086120  0.149915  0.080405

In [30]: dny_ts.plot(title='Selected last letters of male baby names')

Out[30]: <matplotlib.axes.AxesSubplot at 0xf3d1690>

# last letter female names

In [34]: lny_ts = letter_prop.ix[['l','n','y'],'F'].T

In [35]: lny_ts.plot(title='Selected last letters of female baby names')
Out[35]: <matplotlib.axes.AxesSubplot at 0xf3aa810>

Switching back to the table with all the years, we apply the proportion conversion into a new table, appropriately named 'letter_prop'. Then select your letters- I selected 'd', 'n', 'y' for males, making sure to include the suspicious letter 'n'. To get the years to the rows, we transpose the table with '.T', and plot. Simple, right?


As expected, 'd' and 'y' both peaked (1938, and 1960, respectively), then fell. The last letter 'n' on the other hand, remained steady and started to rise dramatically after 1960, to surpass 36% of all last letters in male baby names.


Figure 2. Male Last Letter Proportion (d,n,y)

For female names, I selected 'l', 'n', and 'y', and you can select which ones you prefer, and can be more than three. We can see a clearer picture than in Figure 1, where we only had 3 years to see the proportions of all the letters. It appears that 'y' as a last letter in a female name, along with 'n', both peaked twice. Compare those two to 'l', 'l' has a relatively stable representation throughout the years, with modest peaks in 1897, 1945, and 1983.



Figure 3. Female Last Letter Proportion (l,n,y)


Name Flipping


Names such as Jamie, Leslie, or Madison come to mind when thinking of names given to boys and girls. Has a name ever switched predominately belonging to one sex to the other? Let's look at the name, Leslie and variants of the spelling. Here we will use the 'top1000' data set.

Obtain the set of the names by using '.unique()'. To find all the variants of Leslie, iterate through the list of 'all_names' for 'lesl', and index it through 'all_names' to see what it picked up. Leslie, Lesley, Leslee, Lesli, and Lesly are all variants in the data, of the name Leslie. Now we take this array and index the name variable of 'top1000', finding any matching names in the array with '.isin()'. 

In DataFrame 'filtered', we have all the years with only the qualifying Leslie name variants in the name variable. Grouping by name, make a quick table with '.sum()' to see Leslie (370,429) as the most common spelling, followed by Lesley (35,022) and Lesly (10,067).

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# boy names that became girl names and reverse

In [36]: all_names = top1000.name.unique()

In [37]: mask = np.array(['lesl' in x.lower() for x in all_names])

In [39]: lesley_like = all_names[mask]

In [40]: lesley_like
Out[40]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

In [41]: filtered = top1000[top1000.name.isin(lesley_like)]

In [42]: filtered.groupby('name').births.sum()
Out[42]:
name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64

In [43]: table = filtered.pivot_table('births', rows='year', \
   ....: cols='sex', aggfunc=sum)

In [44]: table = table.div(table.sum(1), axis=0)

In [45]: table.tail()
Out[45]:
sex   F   M
year
2006  1 NaN
2007  1 NaN
2008  1 NaN
2009  1 NaN
2010  1 NaN

In [46]: table.plot(style={'M': 'k-', 'F': 'k--'})
Out[46]: <matplotlib.axes.AxesSubplot at 0xf3e9250>

To format the data for a plot, create a pivot table of the birth sum values with the years as rows, and columns as sex. Again, to normalize the values, we divide the values by the sum of each row, thereby finding the percentage of male and female for each year of each Leslie name variant. Using '.tail()' to print the last 5 values, we see that from 2006 to 2010, Leslie & Friends were all assigned to females. Has it changed historically? Specify the male and female line styles, and plot it to find out!


We discover that around 1955, parents began to name their more female babies Leslie and its name variants. What started out as a name more common to males in the 1880's to 1940's, the trend began to reverse. Today and recently, more females are named Leslie, Lesley, or Lesly, etc. than male babies.


Figure 4. Leslie and Variants in Female and Male Baby Names


Hope you guys enjoyed this Python post about baby names! Here we examined some in depth metrics, such as the last letter in each name. It takes some creativity to wrangle with data in order to discover trending insights. What else can you measure from the names and birth data? Feel free to comment below.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.