Hello Readers,
This post continues directly from exploring baby names in Part 3 of the Python and Pandas Series. In 2007, Laura Wattenburg of babynamewizard.com discovered a peculiar trend in baby names, specifically the last letters in the names of newborns. We will extend the same U.S. Social Security Administration data we used previously in Part 3 to uncover, and verify the same trends. Also, we will search for any names which have switched from being assigned to girls to being assigned to guys, and reverse.
Check out Part 3 for the code and information on downloading, loading, and (re)formatting the data. Here in Part 4, we will use both the 'top1000' and original 'names' DataFrames. That is why we pickled the DataFrames previously- I did suggest that it was a good idea to save your progress (hint hint)! Start Python and let's begin.
Isolating the Last Letter
Since we want to include all the names, we will recover and pull the last letters from the main 'name' DataFrame (rhyming points). Import 'pylab', 'pandas', and 'numpy'. Use an anonymous function, lambda, to create a quick, custom method to pull the last letter from each name, then '.map()' it.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | # extract last letters in names C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02 C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Using matplotlib backend: Qt4Agg In [1]: import pylab In [2]: import pandas as pd In [3]: import numpy as np In [4]: names = pd.load('names.pkl') # use lambda function In [5]: get_last_letter = lambda x: x[-1] In [7]: last_letters = names.name.map(get_last_letter) In [9]: last_letters.name = 'last_letter' # create pivot table last_letters in rows In [10]: table = names.pivot_table('births', rows=last_letters, \ ....: cols=['sex', 'year'], aggfunc=sum) # subset only certain years In [13]: subtable = table.reindex(columns=[1910,1960,2010], level='year') In [14]: subtable.head() Out[14]: sex F M year 1910 1960 2010 1910 1960 2010 last_letter a 108376 691247 670605 977 5204 28438 b NaN 694 450 411 3912 38859 c 5 49 946 482 15476 23125 d 6750 3729 2607 22111 262112 44398 e 133569 435013 313833 28655 178823 129012 |
With the 'last_letters' in tow, create a pivot table with 'births' as values, and rows as 'last_letters' to see a different table than we have seen previously. Instead of an aggregation of names, we have an aggregation of last letters, by sex and year. Subset this table by specifying 3 dates: 1910, 1960, and 2010, for a quick eyeball trend.
Last Letter Proportion
Now that we have the births by last letters, years, and sex in a pivot table, we can go ahead and plot the data. Before we do so, we should normalize the data so we can compare the births between the sex and year groupings. Taking the '.sum()' of the 'subtable' with 3 interval years yields the sum of the births in the groupings, which we will use to divide the births to obtain the proportion.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # create proportion out of total in each year and sex In [15]: subtable.sum() Out[15]: sex year F 1910 396416 1960 2022062 2010 1759010 M 1910 194198 1960 2132588 2010 1898382 dtype: float64 In [16]: letter_prop = subtable / subtable.sum().astype(float) In [17]: import matplotlib.pyplot as plt In [19]: fig, axes = plt.subplots(2, 1, figsize=(10,8)) In [20]: letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male') Out[20]: <matplotlib.axes.AxesSubplot at 0x83ca3d0> In [21]: letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False) Out[21]: <matplotlib.axes.AxesSubplot at 0xf050170> |
Only then can we plot the two figures, separated by male and female plots, with the x-axis for letters, and y-axis for proportion of the 3 years in different colored bars. So here we will explore how all the letters compare over a time period of 60 years. Note in inputs 20 and 21, that we add the plots to the two blank plot spaces separately, which gives us Figure 1.
Figure 1. Male and Female Baby Name Last Letter Proportions |
The legend is the same for both male and female plots, so only one is required. From the three selected years 1910, 1960, and 2010, we see different distributions in last letters between male and female letters. For females, 'a', 'e', 'y', and 'n' were mostly popular throughout the years, and the males matched in 'n'. An argument could be made for 'e' and 'y' for males. But we do see an unusual surge in the last letter of 'n' for male names in 2010. Male last letters were more even distributed across the alphabet, compared to female last letters.
Select Letters Throughout the Years
In the above analysis we examined all the letters and picked a spread of years to analyze the last letters. Here, we flip it around, and look at all the years with selected letters. Since we saw some wild fluctuations in proportion over the 60 year intervals, and looking closer at each year will shed light on the rise and fall of last letters, such as the male last letter 'n'.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | # normalize by year and sex # subset last letters of boy names In [25]: letter_prop = table/table.sum().astype(float) In [26]: dny_ts = letter_prop.ix[['d','n','y'],'M'].T In [29]: dny_ts.head() Out[29]: d n y year 1880 0.083055 0.153213 0.075760 1881 0.083247 0.153214 0.077451 1882 0.085340 0.149560 0.077537 1883 0.084066 0.151646 0.079144 1884 0.086120 0.149915 0.080405 In [30]: dny_ts.plot(title='Selected last letters of male baby names') Out[30]: <matplotlib.axes.AxesSubplot at 0xf3d1690> # last letter female names In [34]: lny_ts = letter_prop.ix[['l','n','y'],'F'].T In [35]: lny_ts.plot(title='Selected last letters of female baby names') Out[35]: <matplotlib.axes.AxesSubplot at 0xf3aa810> |
Switching back to the table with all the years, we apply the proportion conversion into a new table, appropriately named 'letter_prop'. Then select your letters- I selected 'd', 'n', 'y' for males, making sure to include the suspicious letter 'n'. To get the years to the rows, we transpose the table with '.T', and plot. Simple, right?
As expected, 'd' and 'y' both peaked (1938, and 1960, respectively), then fell. The last letter 'n' on the other hand, remained steady and started to rise dramatically after 1960, to surpass 36% of all last letters in male baby names.
Figure 2. Male Last Letter Proportion (d,n,y) |
For female names, I selected 'l', 'n', and 'y', and you can select which ones you prefer, and can be more than three. We can see a clearer picture than in Figure 1, where we only had 3 years to see the proportions of all the letters. It appears that 'y' as a last letter in a female name, along with 'n', both peaked twice. Compare those two to 'l', 'l' has a relatively stable representation throughout the years, with modest peaks in 1897, 1945, and 1983.
Figure 3. Female Last Letter Proportion (l,n,y) |
Name Flipping
Names such as Jamie, Leslie, or Madison come to mind when thinking of names given to boys and girls. Has a name ever switched predominately belonging to one sex to the other? Let's look at the name, Leslie and variants of the spelling. Here we will use the 'top1000' data set.
Obtain the set of the names by using '.unique()'. To find all the variants of Leslie, iterate through the list of 'all_names' for 'lesl', and index it through 'all_names' to see what it picked up. Leslie, Lesley, Leslee, Lesli, and Lesly are all variants in the data, of the name Leslie. Now we take this array and index the name variable of 'top1000', finding any matching names in the array with '.isin()'.
In DataFrame 'filtered', we have all the years with only the qualifying Leslie name variants in the name variable. Grouping by name, make a quick table with '.sum()' to see Leslie (370,429) as the most common spelling, followed by Lesley (35,022) and Lesly (10,067).
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | # boy names that became girl names and reverse In [36]: all_names = top1000.name.unique() In [37]: mask = np.array(['lesl' in x.lower() for x in all_names]) In [39]: lesley_like = all_names[mask] In [40]: lesley_like Out[40]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object) In [41]: filtered = top1000[top1000.name.isin(lesley_like)] In [42]: filtered.groupby('name').births.sum() Out[42]: name Leslee 1082 Lesley 35022 Lesli 929 Leslie 370429 Lesly 10067 Name: births, dtype: int64 In [43]: table = filtered.pivot_table('births', rows='year', \ ....: cols='sex', aggfunc=sum) In [44]: table = table.div(table.sum(1), axis=0) In [45]: table.tail() Out[45]: sex F M year 2006 1 NaN 2007 1 NaN 2008 1 NaN 2009 1 NaN 2010 1 NaN In [46]: table.plot(style={'M': 'k-', 'F': 'k--'}) Out[46]: <matplotlib.axes.AxesSubplot at 0xf3e9250> |
To format the data for a plot, create a pivot table of the birth sum values with the years as rows, and columns as sex. Again, to normalize the values, we divide the values by the sum of each row, thereby finding the percentage of male and female for each year of each Leslie name variant. Using '.tail()' to print the last 5 values, we see that from 2006 to 2010, Leslie & Friends were all assigned to females. Has it changed historically? Specify the male and female line styles, and plot it to find out!
We discover that around 1955, parents began to name their more female babies Leslie and its name variants. What started out as a name more common to males in the 1880's to 1940's, the trend began to reverse. Today and recently, more females are named Leslie, Lesley, or Lesly, etc. than male babies.
Figure 4. Leslie and Variants in Female and Male Baby Names |
Hope you guys enjoyed this Python post about baby names! Here we examined some in depth metrics, such as the last letter in each name. It takes some creativity to wrangle with data in order to discover trending insights. What else can you measure from the names and birth data? Feel free to comment below.
Thanks for reading,
Wayne
@beyondvalence
Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.
loved your post and your blog. keep up the good work. python training in Chennai
ReplyDeletewww.bolavita.fun situs Judi Online Deposit via Go Pay !
ReplyDeleteTerbukti aman, dan sudah terpercaya, Minimal Deposit 50ribu ...
Tersedia Pasaran Lengkap seperti SBOBET - MAXBET - CBET
Informasi selengkapnya hubungi :
WA : +62812-2222-995
BBM : BOLAVITA
Keluaran Togel Kuala Lumpur Hari Ini terbaru 2019
I read this article. I think You put a lot of effort to create this article. I appreciate your work. hindu baby names
ReplyDelete