Hello Readers,
Here in the third part of the Python and Pandas series, we analyze over 1.6 million baby name records from the United States Social Security Administration from 1880 to 2010. A particular name must have at least 5 occurrences for inclusion into the data set. We will explore reading in multiple raw data files, merging them into one DataFrame, subsetting desired portions of the data, creating new variable metrics, and visualizing results.
As usual, start IPython in your command prompt if you want to follow along. You can find the data here, under National data (it unzips to 'names' folder). Let's jump in.
Data Preview
The baby name files are split by year of birth, all in a similar format: 'yob1880.txt', 'yob1881.txt', and so on to 'yob2010.txt'. You can go ahead and import 'pandas', 'pylab', and 'numpy' modules now or when they required later.
Use the '.read_csv()' method to access the first text file of baby names in 1880. We see that there were 2,000 boy and girl names from the data that year (n>=5), with 3 variables: the name, the sex of the baby, and the birth count for that name.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02 C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Using matplotlib backend: Qt4Agg In [1]: import pandas as pd In [2]: names1880 = pd.read_csv('names\yob1880.txt',names=['name','sex','births']) In [3]: names1880 Out[3]: <class 'pandas.core.frame.DataFrame'> Int64Index: 2000 entries, 0 to 1999 Data columns (total 3 columns): name 2000 non-null values sex 2000 non-null values births 2000 non-null values dtypes: int64(1), object(2) In [4]: names1880.groupby('sex').births.sum() Out[4]: sex F 90993 M 110493 Name: births, dtype: int64 |
Performing a quick tab, we group the data by 'sex' and view the count of 'births'. There are 90,993 girls, and 110,493 boys in the 1880 data.
Assembling the Data
Now that we have an idea of the data contents, and we know the pattern of the text file names, we can create a loop to read in the data. At the same time, we add another variable denoting the year for a particular name entry for when all the years are together.
Create a 'years' variable which we will use to iterate through each year text file. Then we read the data, add a 'year' column, append the data to DataFrame 'pieces', then merge them together. Using the '%d' string formatter, we can replace that space with a given variable, 'year'. After using the '.append()' method to add the current 'frame' object to 'pieces', we take advantage of '.concat()' to merge the frames in 'pieces' by row for a completed DataFrame in 'names'. 'ignore_index' should be True because we do not want to keep the original indexes.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # data split by year # so assemble data into single DataFrame and add year field # use %d string formatter to iterate through the years # add 'year' column # append pieces together # .concat merges by row and do not preserve original row numbers In [9]: pieces = [] In [10]: columns = ['name', 'sex', 'births'] In [11]: years = range(1880, 2011) In [12]: for year in years: ....: path = 'names/yob%d.txt' % year ....: frame = pd.read_csv(path, names=columns) ....: frame['year'] = year ....: pieces.append(frame) ....: names = pd.concat(pieces, ignore_index=True) ....: In [13]: names Out[13]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1690784 entries, 0 to 1690783 Data columns (total 4 columns): name 1690784 non-null values sex 1690784 non-null values births 1690784 non-null values year 1690784 non-null values dtypes: int64(2), object(2) # 1,690,783 rows of data with 4 columns In [14]: names.save('names.pkl') |
In the 'names' DataFrame, we have 1,690,783 names from years 1880 to 2010 with 4 columns, including the year. Remember to pickle the DataFrame with '.save()', in other words, save it. It is a hefty file, around 63 MB in size, but Python will do all the heavy lifting!
Exploring the Data
First off, a pivot table is in order. Let's move the 'sex' to the columns, and the 'year' in the rows, while positioning the 'births' in values. Calling '.tail()' will give us the last 5 rows in the table. To get a bigger picture, plot the table of births stratified by sex and year.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | In [17]: total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum) In [19]: total_births.tail() Out[19]: sex F M year 2006 1896468 2050234 2007 1916888 2069242 2008 1883645 2032310 2009 1827643 1973359 2010 1759010 1898382 In [20]: total_births.plot(title='Total Births by sex and year') Out[20]: <matplotlib.axes.AxesSubplot at 0x1a5f4730> |
Figure 1. Total births by sex and year |
Adding Proportion and Subsetting Top 1000 Names
Here we add the column of proportions to our 'names' DataFrame. The proportions will be the number of births out of each total births grouped by year and sex. We define a new method, 'add_prop()' and convert the argument value to a float type for non-integer division purposes. Then we divide the births by the sum of births in the grouping, and return the number.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | # group by year and sex # add proportion of babies given certain name relative to number of births In [23]: def add_prop(group): ....: # integer division floors ....: births = group.births.astype(float) ....: group['prop'] = births / births.sum() ....: return group ....: In [24]: names = names.groupby(['year','sex']).apply(add_prop) In [25]: names Out[25]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1690784 entries, 0 to 1690783 Data columns (total 5 columns): name 1690784 non-null values sex 1690784 non-null values births 1690784 non-null values year 1690784 non-null values prop 1690784 non-null values dtypes: float64(1), int64(2), object(2) # check to see values by group sum to 1 In [28]: import numpy as np In [29]: np.allclose(names.groupby(['year','sex']).prop.sum(),1) Out[29]: True # subset top 1000 births In [30]: def get_top1000(group): ....: return group.sort_index(by='births', ascending=False)[:1000] ....: In [31]: grouped = names.groupby(['year','sex']) In [32]: top1000 = grouped.apply(get_top1000) In [37]: top1000 Out[37]: <class 'pandas.core.frame.DataFrame'> MultiIndex: 261877 entries, (1880, F, 0) to (2010, M, 1677643) Data columns (total 5 columns): name 261877 non-null values sex 261877 non-null values births 261877 non-null values year 261877 non-null values prop 261877 non-null values dtypes: float64(1), int64(2), object(2) |
We pass the groups by year and sex to the 'add_prop()' method using '.apply()'. Confirming the new 'prop' column, the new 'names' DataFrame now has 5 columns. To ensure the birth proportions by groups are accurate, we verify using '.allclose()' method in the 'numpy' module, and compare the sum to 1. Python returns 'True', and we are assured the column values are correct.
With this new DataFrame, we now will subset the top 1000 names by birth in each year and sex grouping. Define a new method, 'get_top1000()', and which sorts the births in descending order, and returns the first 1000 entries. We pass the 'names' DataFrame grouped by year and sex to the 'get_top1000()' method into our new DataFrame, 'top1000'. Instead of over 1.6 million entries, we now have 261,877 entries with which to work.
Some Naming Trends
Because the data include information spanning from 1880 to 2010, we can examine trends in time of baby names for any changes. As usual, remember to save and pickle the 'top1000' DataFrame as we go along the analysis. Begin by separating sex into two different DataFrames for later use.
Now create a pivot table from 'top1000', with births as summed values, years in rows, and names in the columns. There are 131 rows, one for each year and 6,865 columns, or names. We will subset by column, take only specific names, and plot the births for the selected names by year in a single plot. You can choose different names, and I chose John, Harry, Mary, and Marilyn as sample names.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # analyze naming trends In [38]: top1000.save('top1000.pkl') In [39]: boys = top1000[top1000.sex == 'M'] In [40]: girls = top1000[top1000.sex == 'F'] In [41]: total_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=sum) In [42]: total_births Out[42]: <class 'pandas.core.frame.DataFrame'> Int64Index: 131 entries, 1880 to 2010 Columns: 6865 entries, Aaden to Zuri dtypes: float64(6865) In [43]: subset = total_births[['John','Harry','Mary','Marilyn']] In [46]: subset.plot(subplots=True, figsize=(12,10),grid=False,title='Number of births per year ') Out[46]: array([<matplotlib.axes.AxesSubplot object at 0x1A7144D0>, <matplotlib.axes.AxesSubplot object at 0x14C13E30>, <matplotlib.axes.AxesSubplot object at 0x1AA527F0>, <matplotlib.axes.AxesSubplot object at 0x172A05B0>], dtype=object) # plot 'Wayne' In [47]: subsetw = total_births[['Wayne']] In [48]: subsetw.plot(title='Wayne') Out[48]: <matplotlib.axes.AxesSubplot at 0x14c0f2f0> # names growing out of favor? |
Figure 2. Number of Births per Year, Selected Names |
Curious, I plotted my name to see the birth time trends of 'Wayne'. It follows the same rise, peak, and fall around the 1950's, though it followed less of a bimodal distribution.
Figure 3. Number of Births per Year, For Name: Wayne |
Baby Name Diversity
First, a spoiler: the drop in births for certain names have something to do with the name diversity- what parents choose to name their child. The trend changes from 1950's onwards. To examine this change, we turn to the variable we created earlier, the proportion of births in each group by year and sex. So we create a pivot table from the 'top1000', but this time with the sum values as 'prop', 'year' as rows, and 'sex' as columns.
This will allow us to plot Figure 4. Note how proportion total starts at 1.0 in 1880, and slowly drops in 1960 for females and in 1970 for males. The decline in proportion of births accounted by the top 1000 names has declined to around 74% for females and 85% for males by 2010. That means the share of births for other names outside of the top 1000 has risen. More parents are choosing different, more uncommon names to call their newborns.
Figure 4. Proportion of Top Births by Year and Sex |
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | # exploring increases in naming diversity # fewer parents choosing common names for children In [15]: table = top1000.pivot_table('prop', rows= ....: 'year',cols='sex', aggfunc=sum) In [17]: table.plot(title='Sum of table1000.prop by year and sex', \ ....: yticks=np.linspace(0,1.2,13), xticks=range(1880,2020,10)) Out[17]: <matplotlib.axes.AxesSubplot at 0x6a1ead0> # names proportion going down from 1 from top 1000 names In [20]: df = boys[boys.year==2010] In [21]: df Out[21]: <class 'pandas.core.frame.DataFrame'> MultiIndex: 1000 entries, (2010, M, 1676644) to (2010, M, 1677643) Data columns (total 5 columns): name 1000 non-null values sex 1000 non-null values births 1000 non-null values year 1000 non-null values prop 1000 non-null values dtypes: float64(1), int64(2), object(2) In [22]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum() In [23]: prop_cumsum[:10] Out[23]: year sex 2010 M 1676644 0.011523 1676645 0.020934 1676646 0.029959 1676647 0.038930 1676648 0.047817 1676649 0.056579 1676650 0.065155 1676651 0.073414 1676652 0.081528 1676653 0.089621 dtype: float64 In [24]: prop_cumsum.searchsorted(0.5) Out[24]: 116 # index 116, so 117 names in top 50% in 2010 In [25]: df1900 = boys[boys.year==1900] In [26]: prop1900 = df1900.sort_index(by='prop', ascending=False).prop.cumsum() In [27]: prop1900.searchsorted(0.5)+1 Out[27]: 25 # in 1900, top 50% of names covered with 25 names # so there is a large increase in name diversity In [28]: def get_quantile(group, q=0.5): ....: group = group.sort_index(by='prop', ascending=False) ....: return group.prop.cumsum().searchsorted(q)+1 ....: In [29]: diversity = top1000.groupby(['year','sex']).apply(get_quantile) In [30]: diversity = diversity.unstack('sex') In [31]: diversity.head() Out[31]: sex F M year 1880 38 14 1881 38 14 1882 38 15 1883 39 15 1884 39 16 In [32]: diversity.plot(title='Number of popular names in top 50%') Out[32]: <matplotlib.axes.AxesSubplot at 0x2075c1f0> |
This diversity increase can be said for female births as well. Rather than take the number of names in the top 50th percentile for 1880 and 2010, we calculate them for all the years, and both male and female names. Define a new function, 'get_quantile(group, q=0.5)', which sorts the 'group' argument by variable 'prop' in descending order, and returns the index of the sorted cumulative sum value at 0.5, adding 1 at the end to account for the index.
Pass this method through the top 1000 names grouped by year and sex into the 'diversity' DataFrame. Reconfigure the DataFrame table by placing 'sex' in the columns with '.unstack()' to finalize the table. Take a peak at the first 5 years of the 'diversity' data with '.head()', and observe that 38 female names and 14 male names accounted for 50% of the top births in 1880. Lastly we plot the 'diversity' DataFrame, shown below.
Figure 5. Popular Baby Names in 50% percentile |
We see a distinct increase in name diversity around 1985 for both males and females. Historically, female names were more diverse than male names. By 2010, the number of top female names accounting for the top 50 birth percentile more than doubled the male name counterpart.
Again, we reach the end of another lengthy, but I hope, enjoyable post in Python and Pandas concerning baby names. We explored and manipulated a dataset of 1.6 million rows, re-organized DataFrames, created new variables, and visualized various name metrics, all after accessing data split into 131 text files. There is more on baby names we will explore in Part B of this post. So stay tuned for more!
Thanks for reading,
Wayne
@beyondvalence
Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.
www.bolavita.fun situs Judi Online Deposit via Go Pay !
ReplyDeleteTerbukti aman, dan sudah terpercaya, Minimal Deposit 50ribu ...
Tersedia Pasaran Lengkap seperti SBOBET - MAXBET - CBET
Informasi selengkapnya hubungi :
WA : +62812-2222-995
BBM : BOLAVITA
Keluaran Togel Kuala Lumpur Hari Ini terbaru 2019
Valence Analytics: Python And Pandas: Part 3. Baby Names, 1880-2010 >>>>> Download Now
ReplyDelete>>>>> Download Full
Valence Analytics: Python And Pandas: Part 3. Baby Names, 1880-2010 >>>>> Download LINK
>>>>> Download Now
Valence Analytics: Python And Pandas: Part 3. Baby Names, 1880-2010 >>>>> Download Full
>>>>> Download LINK
Valence Analytics: Python And Pandas: Part 3. Baby Names, 1880-2010 >>>>> Download Now
ReplyDelete>>>>> Download Full
Valence Analytics: Python And Pandas: Part 3. Baby Names, 1880-2010 >>>>> Download LINK
>>>>> Download Now
Valence Analytics: Python And Pandas: Part 3. Baby Names, 1880-2010 >>>>> Download Full
>>>>> Download LINK U0