Pages

Monday, September 15, 2014

Python and Pandas: Part 4. More Baby Names


Hello Readers,

This post continues directly from exploring baby names in Part 3 of the Python and Pandas Series. In 2007, Laura Wattenburg of babynamewizard.com discovered a peculiar trend in baby names, specifically the last letters in the names of newborns. We will extend the same U.S. Social Security Administration data we used previously in Part 3 to uncover, and verify the same trends. Also, we will search for any names which have switched from being assigned to girls to being assigned to guys, and reverse.

Check out Part 3 for the code and information on downloading, loading, and (re)formatting the data. Here in Part 4, we will use both the 'top1000' and original 'names' DataFrames. That is why we pickled the DataFrames previously- I did suggest that it was a good idea to save your progress (hint hint)! Start Python and let's begin.


Isolating the Last Letter


Since we want to include all the names, we will recover and pull the last letters from the main 'name' DataFrame (rhyming points). Import 'pylab', 'pandas', and 'numpy'. Use an anonymous function, lambda, to create a quick, custom method to pull the last letter from each name, then '.map()' it.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# extract last letters in names

C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02

C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: import pylab

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: names = pd.load('names.pkl')

# use lambda function

In [5]: get_last_letter = lambda x: x[-1]

In [7]: last_letters = names.name.map(get_last_letter)

In [9]: last_letters.name = 'last_letter'

# create pivot table last_letters in rows

In [10]: table = names.pivot_table('births', rows=last_letters, \
   ....: cols=['sex', 'year'], aggfunc=sum)
   
# subset only certain years
   
In [13]: subtable = table.reindex(columns=[1910,1960,2010], level='year')

In [14]: subtable.head()
Out[14]:
sex               F                      M
year           1910    1960    2010   1910    1960    2010
last_letter
a            108376  691247  670605    977    5204   28438
b               NaN     694     450    411    3912   38859
c                 5      49     946    482   15476   23125
d              6750    3729    2607  22111  262112   44398
e            133569  435013  313833  28655  178823  129012

With the 'last_letters' in tow, create a pivot table with 'births' as values, and rows as 'last_letters' to see a different table than we have seen previously. Instead of an aggregation of names, we have an aggregation of last letters, by sex and year. Subset this table by specifying 3 dates: 1910, 1960, and 2010, for a quick eyeball trend.



Last Letter Proportion


Now that we have the births by last letters, years, and sex in a pivot table, we can go ahead and plot the data. Before we do so, we should normalize the data so we can compare the births between the sex and year groupings. Taking the '.sum()' of the 'subtable' with 3 interval years yields the sum of the births in the groupings, which we will use to divide the births to obtain the proportion.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# create proportion out of total in each year and sex

In [15]: subtable.sum()
Out[15]:
sex  year
F    1910     396416
     1960    2022062
     2010    1759010
M    1910     194198
     1960    2132588
     2010    1898382
dtype: float64

In [16]: letter_prop = subtable / subtable.sum().astype(float)

In [17]: import matplotlib.pyplot as plt

In [19]: fig, axes = plt.subplots(2, 1, figsize=(10,8))

In [20]: letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
Out[20]: <matplotlib.axes.AxesSubplot at 0x83ca3d0>

In [21]: letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
Out[21]: <matplotlib.axes.AxesSubplot at 0xf050170>

Only then can we plot the two figures, separated by male and female plots, with the x-axis for letters, and y-axis for proportion of the 3 years in different colored bars. So here we will explore how all the letters compare over a time period of 60 years. Note in inputs 20 and 21, that we add the plots to the two blank plot spaces separately, which gives us Figure 1.

Figure 1. Male and Female Baby Name Last Letter Proportions

The legend is the same for both male and female plots, so only one is required. From the three selected years 1910, 1960, and 2010, we see different distributions in last letters between male and female letters. For females, 'a', 'e', 'y', and 'n' were mostly popular throughout the years, and the males matched in 'n'. An argument could be made for 'e' and 'y' for males. But we do see an unusual surge in the last letter of 'n' for male names in 2010. Male last letters were more even distributed across the alphabet, compared to female last letters.



Select Letters Throughout the Years


In the above analysis we examined all the letters and picked a spread of years to analyze the last letters. Here, we flip it around, and look at all the years with selected letters. Since we saw some wild fluctuations in proportion over the 60 year intervals, and looking closer at each year will shed light on the rise and fall of last letters, such as the male last letter 'n'.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# normalize by year and sex
# subset last letters of boy names

In [25]: letter_prop = table/table.sum().astype(float)

In [26]: dny_ts = letter_prop.ix[['d','n','y'],'M'].T

In [29]: dny_ts.head()
Out[29]:
             d         n         y
year
1880  0.083055  0.153213  0.075760
1881  0.083247  0.153214  0.077451
1882  0.085340  0.149560  0.077537
1883  0.084066  0.151646  0.079144
1884  0.086120  0.149915  0.080405

In [30]: dny_ts.plot(title='Selected last letters of male baby names')

Out[30]: <matplotlib.axes.AxesSubplot at 0xf3d1690>

# last letter female names

In [34]: lny_ts = letter_prop.ix[['l','n','y'],'F'].T

In [35]: lny_ts.plot(title='Selected last letters of female baby names')
Out[35]: <matplotlib.axes.AxesSubplot at 0xf3aa810>

Switching back to the table with all the years, we apply the proportion conversion into a new table, appropriately named 'letter_prop'. Then select your letters- I selected 'd', 'n', 'y' for males, making sure to include the suspicious letter 'n'. To get the years to the rows, we transpose the table with '.T', and plot. Simple, right?


As expected, 'd' and 'y' both peaked (1938, and 1960, respectively), then fell. The last letter 'n' on the other hand, remained steady and started to rise dramatically after 1960, to surpass 36% of all last letters in male baby names.


Figure 2. Male Last Letter Proportion (d,n,y)

For female names, I selected 'l', 'n', and 'y', and you can select which ones you prefer, and can be more than three. We can see a clearer picture than in Figure 1, where we only had 3 years to see the proportions of all the letters. It appears that 'y' as a last letter in a female name, along with 'n', both peaked twice. Compare those two to 'l', 'l' has a relatively stable representation throughout the years, with modest peaks in 1897, 1945, and 1983.



Figure 3. Female Last Letter Proportion (l,n,y)


Name Flipping


Names such as Jamie, Leslie, or Madison come to mind when thinking of names given to boys and girls. Has a name ever switched predominately belonging to one sex to the other? Let's look at the name, Leslie and variants of the spelling. Here we will use the 'top1000' data set.

Obtain the set of the names by using '.unique()'. To find all the variants of Leslie, iterate through the list of 'all_names' for 'lesl', and index it through 'all_names' to see what it picked up. Leslie, Lesley, Leslee, Lesli, and Lesly are all variants in the data, of the name Leslie. Now we take this array and index the name variable of 'top1000', finding any matching names in the array with '.isin()'. 

In DataFrame 'filtered', we have all the years with only the qualifying Leslie name variants in the name variable. Grouping by name, make a quick table with '.sum()' to see Leslie (370,429) as the most common spelling, followed by Lesley (35,022) and Lesly (10,067).

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# boy names that became girl names and reverse

In [36]: all_names = top1000.name.unique()

In [37]: mask = np.array(['lesl' in x.lower() for x in all_names])

In [39]: lesley_like = all_names[mask]

In [40]: lesley_like
Out[40]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

In [41]: filtered = top1000[top1000.name.isin(lesley_like)]

In [42]: filtered.groupby('name').births.sum()
Out[42]:
name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64

In [43]: table = filtered.pivot_table('births', rows='year', \
   ....: cols='sex', aggfunc=sum)

In [44]: table = table.div(table.sum(1), axis=0)

In [45]: table.tail()
Out[45]:
sex   F   M
year
2006  1 NaN
2007  1 NaN
2008  1 NaN
2009  1 NaN
2010  1 NaN

In [46]: table.plot(style={'M': 'k-', 'F': 'k--'})
Out[46]: <matplotlib.axes.AxesSubplot at 0xf3e9250>

To format the data for a plot, create a pivot table of the birth sum values with the years as rows, and columns as sex. Again, to normalize the values, we divide the values by the sum of each row, thereby finding the percentage of male and female for each year of each Leslie name variant. Using '.tail()' to print the last 5 values, we see that from 2006 to 2010, Leslie & Friends were all assigned to females. Has it changed historically? Specify the male and female line styles, and plot it to find out!


We discover that around 1955, parents began to name their more female babies Leslie and its name variants. What started out as a name more common to males in the 1880's to 1940's, the trend began to reverse. Today and recently, more females are named Leslie, Lesley, or Lesly, etc. than male babies.


Figure 4. Leslie and Variants in Female and Male Baby Names


Hope you guys enjoyed this Python post about baby names! Here we examined some in depth metrics, such as the last letter in each name. It takes some creativity to wrangle with data in order to discover trending insights. What else can you measure from the names and birth data? Feel free to comment below.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.

3 comments:

  1. loved your post and your blog. keep up the good work. python training in Chennai

    ReplyDelete
  2. www.bolavita.fun situs Judi Online Deposit via Go Pay !

    Terbukti aman, dan sudah terpercaya, Minimal Deposit 50ribu ...

    Tersedia Pasaran Lengkap seperti SBOBET - MAXBET - CBET

    Informasi selengkapnya hubungi :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Keluaran Togel Kuala Lumpur Hari Ini terbaru 2019

    ReplyDelete
  3. I read this article. I think You put a lot of effort to create this article. I appreciate your work. hindu baby names

    ReplyDelete