Loading...

Saturday, September 13, 2014

Python and Pandas: Part 3. Baby Names, 1880-2010


Hello Readers,

Here in the third part of the Python and Pandas series, we analyze over 1.6 million baby name records from the United States Social Security Administration from 1880 to 2010. A particular name must have at least 5 occurrences for inclusion into the data set. We will explore reading in multiple raw data files, merging them into one DataFrame, subsetting desired portions of the data, creating new variable metrics, and visualizing results.

As usual, start IPython in your command prompt if you want to follow along. You can find the data here, under National data (it unzips to 'names' folder). Let's jump in.


Data Preview


The baby name files are split by year of birth, all in a similar format: 'yob1880.txt', 'yob1881.txt', and so on to 'yob2010.txt'. You can go ahead and import 'pandas', 'pylab', and 'numpy' modules now or when they required later.

Use the '.read_csv()' method to access the first text file of baby names in 1880. We see that there were 2,000 boy and girl names from the data that year (n>=5), with 3 variables: the name, the sex of the baby, and the birth count for that name.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02

C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: import pandas as pd

In [2]: names1880 = pd.read_csv('names\yob1880.txt',names=['name','sex','births'])

In [3]: names1880
Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 3 columns):
name      2000  non-null values
sex       2000  non-null values
births    2000  non-null values
dtypes: int64(1), object(2)

In [4]: names1880.groupby('sex').births.sum()
Out[4]:
sex
F       90993
M      110493
Name: births, dtype: int64

Performing a quick tab, we group the data by 'sex' and view the count of 'births'. There are 90,993 girls, and 110,493 boys in the 1880 data.



Assembling the Data


Now that we have an idea of the data contents, and we know the pattern of the text file names, we can create a loop to read in the data. At the same time, we add another variable denoting the year for a particular name entry for when all the years are together.

Create a 'years' variable which we will use to iterate through each year text file. Then we read the data, add a 'year' column, append the data to DataFrame 'pieces', then merge them together. Using the '%d' string formatter, we can replace that space with a given variable, 'year'. After using the '.append()' method to add the current 'frame' object to 'pieces', we take advantage of '.concat()' to merge the frames in 'pieces' by row for a completed DataFrame in 'names'. 'ignore_index' should be True because we do not want to keep the original indexes.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# data split by year
# so assemble data into single DataFrame and add year field
# use %d string formatter to iterate through the years
# add 'year' column
# append pieces together
# .concat merges by row and do not preserve original row numbers

In [9]: pieces = []

In [10]: columns = ['name', 'sex', 'births']

In [11]: years = range(1880, 2011)

In [12]: for year in years:
   ....:     path = 'names/yob%d.txt' % year
   ....:     frame = pd.read_csv(path, names=columns)
   ....:     frame['year'] = year
   ....:     pieces.append(frame)
   ....:     names = pd.concat(pieces, ignore_index=True)
   ....:

In [13]: names
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
dtypes: int64(2), object(2)

# 1,690,783 rows of data with 4 columns

In [14]: names.save('names.pkl')

In the 'names' DataFrame, we have 1,690,783 names from years 1880 to 2010 with 4 columns, including the year. Remember to pickle the DataFrame with '.save()', in other words, save it. It is a hefty file, around 63 MB in size, but Python will do all the heavy lifting!



Exploring the Data


First off, a pivot table is in order. Let's move the 'sex' to the columns, and the 'year' in the rows, while positioning the 'births' in values. Calling '.tail()' will give us the last 5 rows in the table. To get a bigger picture, plot the table of births stratified by sex and year.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
In [17]: total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)

In [19]: total_births.tail()
Out[19]:
sex         F        M
year
2006  1896468  2050234
2007  1916888  2069242
2008  1883645  2032310
2009  1827643  1973359
2010  1759010  1898382

In [20]: total_births.plot(title='Total Births by sex and year')
Out[20]: <matplotlib.axes.AxesSubplot at 0x1a5f4730>

Figure 1. Total births by sex and year
We can observe the birth trends rise and fall based on economic trends- they births tend to fall in times of recession, and male births started to outpace female births after WWII.


Adding Proportion and Subsetting Top 1000 Names


Here we add the column of proportions to our 'names' DataFrame. The proportions will be the number of births out of each total births grouped by year and sex. We define a new method, 'add_prop()' and convert the argument value to a float type for non-integer division purposes. Then we divide the births by the sum of births in the grouping, and return the number.

Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# group by year and sex
# add proportion of babies given certain name relative to number of births

In [23]: def add_prop(group):
   ....:     # integer division floors
   ....:     births = group.births.astype(float)
   ....:     group['prop'] = births / births.sum()
   ....:     return group
   ....:

In [24]: names = names.groupby(['year','sex']).apply(add_prop)

In [25]: names
Out[25]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 5 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
prop      1690784  non-null values
dtypes: float64(1), int64(2), object(2)

# check to see values by group sum to 1

In [28]: import numpy as np

In [29]: np.allclose(names.groupby(['year','sex']).prop.sum(),1)
Out[29]: True

# subset top 1000 births

In [30]: def get_top1000(group):
   ....:     return group.sort_index(by='births', ascending=False)[:1000]
   ....:

In [31]: grouped = names.groupby(['year','sex'])

In [32]: top1000 = grouped.apply(get_top1000)

In [37]: top1000
Out[37]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 261877 entries, (1880, F, 0) to (2010, M, 1677643)
Data columns (total 5 columns):
name      261877  non-null values
sex       261877  non-null values
births    261877  non-null values
year      261877  non-null values
prop      261877  non-null values
dtypes: float64(1), int64(2), object(2)

We pass the groups by year and sex to the 'add_prop()' method using '.apply()'. Confirming the new 'prop' column, the new 'names' DataFrame now has 5 columns. To ensure the birth proportions by groups are accurate, we verify using '.allclose()' method in the 'numpy' module, and compare the sum to 1. Python returns 'True', and we are assured the column values are correct.


With this new DataFrame, we now will subset the top 1000 names by birth in each year and sex grouping. Define a new method, 'get_top1000()', and which sorts the births in descending order, and returns the first 1000 entries. We pass the 'names' DataFrame grouped by year and sex to the 'get_top1000()' method into our new DataFrame, 'top1000'. Instead of over 1.6 million entries, we now have 261,877 entries with which to work.


Some Naming Trends


Because the data include information spanning from 1880 to 2010, we can examine trends in time of baby names for any changes. As usual, remember to save and pickle the 'top1000' DataFrame as we go along the analysis. Begin by separating sex into two different DataFrames for later use. 

Now create a pivot table from 'top1000', with births as summed values, years in rows, and names in the columns. There are 131 rows, one for each year and 6,865 columns, or names. We will subset by column, take only specific names, and plot the births for the selected names by year in a single plot. You can choose different names, and I chose John, Harry, Mary, and Marilyn as sample names.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# analyze naming trends

In [38]: top1000.save('top1000.pkl')

In [39]: boys = top1000[top1000.sex == 'M']

In [40]: girls = top1000[top1000.sex == 'F']

In [41]: total_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=sum)

In [42]: total_births
Out[42]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6865 entries, Aaden to Zuri
dtypes: float64(6865)

In [43]: subset = total_births[['John','Harry','Mary','Marilyn']]

In [46]: subset.plot(subplots=True, figsize=(12,10),grid=False,title='Number of births per year
')
Out[46]:
array([<matplotlib.axes.AxesSubplot object at 0x1A7144D0>,
       <matplotlib.axes.AxesSubplot object at 0x14C13E30>,
       <matplotlib.axes.AxesSubplot object at 0x1AA527F0>,
       <matplotlib.axes.AxesSubplot object at 0x172A05B0>], dtype=object)

# plot 'Wayne'
    
In [47]: subsetw = total_births[['Wayne']]

In [48]: subsetw.plot(title='Wayne')
Out[48]: <matplotlib.axes.AxesSubplot at 0x14c0f2f0>

# names growing out of favor?

Figure 2. Number of Births per Year, Selected Names
For John, Harry, and Mary, they have bimodal peaks about 1920s and 1950s. For Marilyn, the name became steadily popular from the 1930's to the late 1950's. For all four names, we observe a fall in births per year. Are those names really becoming more uncommon? We will discover what is happening in the data below.

Curious, I plotted my name to see the birth time trends of 'Wayne'. It follows the same rise, peak, and fall around the 1950's, though it followed less of a bimodal distribution.


Figure 3. Number of Births per Year, For Name: Wayne

Baby Name Diversity


First, a spoiler: the drop in births for certain names have something to do with the name diversity- what parents choose to name their child. The trend changes from 1950's onwards. To examine this change, we turn to the variable we created earlier, the proportion of births in each group by year and sex. So we create a pivot table from the 'top1000', but this time with the sum values as 'prop', 'year' as rows, and 'sex' as columns.

This will allow us to plot Figure 4. Note how proportion total starts at 1.0 in 1880, and slowly drops in 1960 for females and in 1970 for males. The decline in proportion of births accounted by the top 1000 names has declined to around 74% for females and 85% for males by 2010. That means the share of births for other names outside of the top 1000 has risen. More parents are choosing different, more uncommon names to call their newborns.


Figure 4. Proportion of Top Births by Year and Sex
We can check this by examining the boys and girls DataFrames we created earlier. We subset the year 2010, sort by proportion in descending order, then take the cumulative sum of the proportions of births. Taking the first 10 names, we see that the top name were roughly 1.15% of the total male births in 2010. Using '.searchsorted(0.5)' to find the sorted index of the 50th percentile, Python returns 116. Therefore 117 names consist of 50% of the male births in 2010. We compare this number to the 50% percentile of male births in 1880, which is 25. From 1880 to 2010, the number of names in the top 50% percentile of male births increased over 350% from 25 to 117. Male name diversity sure increased over the years.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# exploring increases in naming diversity
# fewer parents choosing common names for children

In [15]: table = top1000.pivot_table('prop', rows=
   ....: 'year',cols='sex', aggfunc=sum)

In [17]: table.plot(title='Sum of table1000.prop by year and sex', \
   ....: yticks=np.linspace(0,1.2,13), xticks=range(1880,2020,10))
Out[17]: <matplotlib.axes.AxesSubplot at 0x6a1ead0>

# names proportion going down from 1 from top 1000 names

In [20]: df = boys[boys.year==2010]

In [21]: df
Out[21]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1000 entries, (2010, M, 1676644) to (2010, M, 1677643)
Data columns (total 5 columns):
name      1000  non-null values
sex       1000  non-null values
births    1000  non-null values
year      1000  non-null values
prop      1000  non-null values
dtypes: float64(1), int64(2), object(2)

In [22]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()

In [23]: prop_cumsum[:10]
Out[23]:
year  sex
2010  M    1676644    0.011523
           1676645    0.020934
           1676646    0.029959
           1676647    0.038930
           1676648    0.047817
           1676649    0.056579
           1676650    0.065155
           1676651    0.073414
           1676652    0.081528
           1676653    0.089621
dtype: float64

In [24]: prop_cumsum.searchsorted(0.5)
Out[24]: 116

# index 116, so 117 names in top 50% in 2010

In [25]: df1900 = boys[boys.year==1900]

In [26]: prop1900 = df1900.sort_index(by='prop', ascending=False).prop.cumsum()

In [27]: prop1900.searchsorted(0.5)+1
Out[27]: 25

# in 1900, top 50% of names covered with 25 names
# so there is a large increase in name diversity

In [28]: def get_quantile(group, q=0.5):
   ....:     group = group.sort_index(by='prop', ascending=False)
   ....:     return group.prop.cumsum().searchsorted(q)+1
   ....:

In [29]: diversity = top1000.groupby(['year','sex']).apply(get_quantile)

In [30]: diversity = diversity.unstack('sex')

In [31]: diversity.head()
Out[31]:
sex    F   M
year
1880  38  14
1881  38  14
1882  38  15
1883  39  15
1884  39  16

In [32]: diversity.plot(title='Number of popular names in top 50%')
Out[32]: <matplotlib.axes.AxesSubplot at 0x2075c1f0>

This diversity increase can be said for female births as well. Rather than take the number of names in the top 50th percentile for 1880 and 2010, we calculate them for all the years, and both male and female names. Define a new function, 'get_quantile(group, q=0.5)', which sorts the 'group' argument by variable 'prop' in descending order, and returns the index of the sorted cumulative sum value at 0.5, adding 1 at the end to account for the index. 


Pass this method through the top 1000 names grouped by year and sex into the 'diversity' DataFrame. Reconfigure the DataFrame table by placing 'sex' in the columns with '.unstack()' to finalize the table. Take a peak at the first 5 years of the 'diversity' data with '.head()', and observe that 38 female names and 14 male names accounted for 50% of the top births in 1880. Lastly we plot the 'diversity' DataFrame, shown below.


Figure 5. Popular Baby Names in 50% percentile

We see a distinct increase in name diversity around 1985 for both males and females. Historically, female names were more diverse than male names. By 2010, the number of top female names accounting for the top 50 birth percentile more than doubled the male name counterpart.

Again, we reach the end of another lengthy, but I hope, enjoyable post in Python and Pandas concerning baby names. We explored and manipulated a dataset of 1.6 million rows, re-organized DataFrames, created new variables, and visualized various name metrics, all after accessing data split into 131 text files. There is more on baby names we will explore in Part B of this post. So stay tuned for more!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.

4 comments:

  1. I like your blog, I read this blog please update more content on python, further check it once at python online training

    ReplyDelete
  2. www.bolavita.fun situs Judi Online Deposit via Go Pay !

    Terbukti aman, dan sudah terpercaya, Minimal Deposit 50ribu ...

    Tersedia Pasaran Lengkap seperti SBOBET - MAXBET - CBET

    Informasi selengkapnya hubungi :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Keluaran Togel Kuala Lumpur Hari Ini terbaru 2019

    ReplyDelete