Valence Analytics: import data

Showing posts with label import data. Show all posts

Monday, September 29, 2014

Handling Large Data with SQL Server

Hello Readers,

Big Data is all the rage these days. On this blog we have mainly used R as our analysis program of choice (sorry Python) to examine, model, and predict on the data. R is optimal for data with hundred thousands of rows or less, and dealing with larger data sets with millions of rows or more usually slows R down to a crawl. (Remember the neural network post?) Another deciding factor on computation speed is your computer setup.

Because there are millions of rows in big data sets, we need only sample a few hundred thousand of them into R. But where will we store the millions of rows prior to sampling? Easy, into a data warehouse made for million plus row data sets, a SQL Server database.

We have not dealt with hundreds of millions or billions of rows, so resist the urge to mention the Hadoop distributed file system. Or should I say not yet? Anyways, here we will load million plus row U.S. Census data sets in preparation for R sampling.

Getting Census Data

Specifically we concentrate on two data sets, U.S. Census American Community Survey 2011, where we can find nationwide population and housing data. You can download the data via ftp here, or choose 2011 ACS 1-year PUMS from the ACS site. The ACS encompasses social, economic, demographic, and housing data. A screenshot of the page is shown below:

Figure 1. American Community Survey (Census) Webpage

After clicking the red highlighted link, 2011 ACS 1-year PUMS, we choose from CSV or SAS file format. Take your pick, I chose CSV for importing into SQL Server.

Figure 2. Choose File Format - CSV

Next, the site gives us choices whether to download the entire dataset for the U.S. or individual states. Of course, download both the Population and Housing data for the entire U.S. Be warned, these two files take up nearly 1 GB of hard drive space, so be patient! Download accelerator anyone?

Figure 3. Download Population and Housing Data Files

After we have the zip files downloaded, unzip them. Note that the housing data 'ss11hus*' and population data 'ss11pus*' have parts a and b.

Figure 4. Unzipped Files

You can also find the helpful README file for the data at the ACS site as well.

Importing into SQL Server

Now for the crucial part, where we import the data into a database system, SQL Server, instead of reading the CSV files into R directly. After connecting to your server, select which database where you would import the CSV files. Right-click and under Tasks, > choose Import Files.

Figure 5. Import Files Option

Next we specify the data source. The CSV files are not Excel or Access files, rather they are a Flat File Source. So select the appropriate option, and select a CSV file from the unzipped file lcoation. On the left you can see a Columns view option under General. You can preview the file contents prior to importing it.

Figure 6. Select Flat File Source

Figure 7 shows the Columns for our preview, so we can ensure the proper headers and values are read by the SQL Server Wizard.

Figure 7. Column Previews

Even though we right-clicked a certain database, SQL Server still asks us in which database we want to import the data. You can change your mind, if you clicked wrong earlier, so take your pick.

Figure 8. Table Destination

Here we encounter the Source Table Option window. We can alter the schema and name of table with Edit Mapping at the bottom. The current schema and name are set as 'dbo' and 'ss11pusa', respectively. If you want to use a different schema than the default, now is the time to impose the change.

Figure 9. Select Source Tables

SQL Server gives us a last overview option before we start the importing process. Ready? Click Finish.

Figure 10. Last Check

And here we go! Look at the Import Wizard chug along as it imports over a million rows!

Figure 11. Import Progressing

When the process is finished, we are rewarded with... a new table lots of rows. Take a look below. 1.6 million? Not bad. Now you just have to import the remaining 3 for a complete set!

Figure 12. Execution Successful

Remember to refresh the Server instance for it to reflect the new table changes. Now you can query away in a new query window to explore the data, or you can keep the Server on and jump into R to establish a database connection using ODBC (open database connectivity).

Figure 13. Locating Imported Table

I thought this post made up for the previous few posts' lack of pictures. Was the compensation count this time sufficient (13)? Here we learned how import a large data set, separated into 4 tables, into a database management system, SQL Server. The next post will feature the next step in accessing Big Data in R, the database connection. There we shall use SQL Server as a data warehouse and R as the analytics tool. Stay tuned.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Saturday, September 13, 2014

Python and Pandas: Part 3. Baby Names, 1880-2010

Hello Readers,

Here in the third part of the Python and Pandas series, we analyze over 1.6 million baby name records from the United States Social Security Administration from 1880 to 2010. A particular name must have at least 5 occurrences for inclusion into the data set. We will explore reading in multiple raw data files, merging them into one DataFrame, subsetting desired portions of the data, creating new variable metrics, and visualizing results.

As usual, start IPython in your command prompt if you want to follow along. You can find the data here, under National data (it unzips to 'names' folder). Let's jump in.

Data Preview

The baby name files are split by year of birth, all in a similar format: 'yob1880.txt', 'yob1881.txt', and so on to 'yob2010.txt'. You can go ahead and import 'pandas', 'pylab', and 'numpy' modules now or when they required later.

Use the '.read_csv()' method to access the first text file of baby names in 1880. We see that there were 2,000 boy and girl names from the data that year (n>=5), with 3 variables: the name, the sex of the baby, and the birth count for that name.

Code:

C:\Users\wayne>cd .\Documents\python\dataAnalysis\git\ch02

C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: import pandas as pd

In [2]: names1880 = pd.read_csv('names\yob1880.txt',names=['name','sex','births'])

In [3]: names1880
Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 3 columns):
name      2000  non-null values
sex       2000  non-null values
births    2000  non-null values
dtypes: int64(1), object(2)

In [4]: names1880.groupby('sex').births.sum()
Out[4]:
sex
F       90993
M      110493
Name: births, dtype: int64

Performing a quick tab, we group the data by 'sex' and view the count of 'births'. There are 90,993 girls, and 110,493 boys in the 1880 data.

Assembling the Data

Now that we have an idea of the data contents, and we know the pattern of the text file names, we can create a loop to read in the data. At the same time, we add another variable denoting the year for a particular name entry for when all the years are together.

Create a 'years' variable which we will use to iterate through each year text file. Then we read the data, add a 'year' column, append the data to DataFrame 'pieces', then merge them together. Using the '%d' string formatter, we can replace that space with a given variable, 'year'. After using the '.append()' method to add the current 'frame' object to 'pieces', we take advantage of '.concat()' to merge the frames in 'pieces' by row for a completed DataFrame in 'names'. 'ignore_index' should be True because we do not want to keep the original indexes.

Code:

# data split by year
# so assemble data into single DataFrame and add year field
# use %d string formatter to iterate through the years
# add 'year' column
# append pieces together
# .concat merges by row and do not preserve original row numbers

In [9]: pieces = []

In [10]: columns = ['name', 'sex', 'births']

In [11]: years = range(1880, 2011)

In [12]: for year in years:
   ....:     path = 'names/yob%d.txt' % year
   ....:     frame = pd.read_csv(path, names=columns)
   ....:     frame['year'] = year
   ....:     pieces.append(frame)
   ....:     names = pd.concat(pieces, ignore_index=True)
   ....:

In [13]: names
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
dtypes: int64(2), object(2)

# 1,690,783 rows of data with 4 columns

In [14]: names.save('names.pkl')

In the 'names' DataFrame, we have 1,690,783 names from years 1880 to 2010 with 4 columns, including the year. Remember to pickle the DataFrame with '.save()', in other words, save it. It is a hefty file, around 63 MB in size, but Python will do all the heavy lifting!

Exploring the Data

First off, a pivot table is in order. Let's move the 'sex' to the columns, and the 'year' in the rows, while positioning the 'births' in values. Calling '.tail()' will give us the last 5 rows in the table. To get a bigger picture, plot the table of births stratified by sex and year.

Code:

In [17]: total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)

In [19]: total_births.tail()
Out[19]:
sex         F        M
year
2006  1896468  2050234
2007  1916888  2069242
2008  1883645  2032310
2009  1827643  1973359
2010  1759010  1898382

In [20]: total_births.plot(title='Total Births by sex and year')
Out[20]: <matplotlib.axes.AxesSubplot at 0x1a5f4730>

Figure 1. Total births by sex and year

We can observe the birth trends rise and fall based on economic trends- they births tend to fall in times of recession, and male births started to outpace female births after WWII.

Adding Proportion and Subsetting Top 1000 Names

Here we add the column of proportions to our 'names' DataFrame. The proportions will be the number of births out of each total births grouped by year and sex. We define a new method, 'add_prop()' and convert the argument value to a float type for non-integer division purposes. Then we divide the births by the sum of births in the grouping, and return the number.

Code:

# group by year and sex
# add proportion of babies given certain name relative to number of births

In [23]: def add_prop(group):
   ....:     # integer division floors
   ....:     births = group.births.astype(float)
   ....:     group['prop'] = births / births.sum()
   ....:     return group
   ....:

In [24]: names = names.groupby(['year','sex']).apply(add_prop)

In [25]: names
Out[25]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 5 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
prop      1690784  non-null values
dtypes: float64(1), int64(2), object(2)

# check to see values by group sum to 1

In [28]: import numpy as np

In [29]: np.allclose(names.groupby(['year','sex']).prop.sum(),1)
Out[29]: True

# subset top 1000 births

In [30]: def get_top1000(group):
   ....:     return group.sort_index(by='births', ascending=False)[:1000]
   ....:

In [31]: grouped = names.groupby(['year','sex'])

In [32]: top1000 = grouped.apply(get_top1000)

In [37]: top1000
Out[37]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 261877 entries, (1880, F, 0) to (2010, M, 1677643)
Data columns (total 5 columns):
name      261877  non-null values
sex       261877  non-null values
births    261877  non-null values
year      261877  non-null values
prop      261877  non-null values
dtypes: float64(1), int64(2), object(2)

We pass the groups by year and sex to the 'add_prop()' method using '.apply()'. Confirming the new 'prop' column, the new 'names' DataFrame now has 5 columns. To ensure the birth proportions by groups are accurate, we verify using '.allclose()' method in the 'numpy' module, and compare the sum to 1. Python returns 'True', and we are assured the column values are correct.

With this new DataFrame, we now will subset the top 1000 names by birth in each year and sex grouping. Define a new method, 'get_top1000()', and which sorts the births in descending order, and returns the first 1000 entries. We pass the 'names' DataFrame grouped by year and sex to the 'get_top1000()' method into our new DataFrame, 'top1000'. Instead of over 1.6 million entries, we now have 261,877 entries with which to work.

Some Naming Trends

Because the data include information spanning from 1880 to 2010, we can examine trends in time of baby names for any changes. As usual, remember to save and pickle the 'top1000' DataFrame as we go along the analysis. Begin by separating sex into two different DataFrames for later use.

Now create a pivot table from 'top1000', with births as summed values, years in rows, and names in the columns. There are 131 rows, one for each year and 6,865 columns, or names. We will subset by column, take only specific names, and plot the births for the selected names by year in a single plot. You can choose different names, and I chose John, Harry, Mary, and Marilyn as sample names.

Code:

# analyze naming trends

In [38]: top1000.save('top1000.pkl')

In [39]: boys = top1000[top1000.sex == 'M']

In [40]: girls = top1000[top1000.sex == 'F']

In [41]: total_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=sum)

In [42]: total_births
Out[42]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6865 entries, Aaden to Zuri
dtypes: float64(6865)

In [43]: subset = total_births[['John','Harry','Mary','Marilyn']]

In [46]: subset.plot(subplots=True, figsize=(12,10),grid=False,title='Number of births per year
')
Out[46]:
array([<matplotlib.axes.AxesSubplot object at 0x1A7144D0>,
       <matplotlib.axes.AxesSubplot object at 0x14C13E30>,
       <matplotlib.axes.AxesSubplot object at 0x1AA527F0>,
       <matplotlib.axes.AxesSubplot object at 0x172A05B0>], dtype=object)

# plot 'Wayne'
    
In [47]: subsetw = total_births[['Wayne']]

In [48]: subsetw.plot(title='Wayne')
Out[48]: <matplotlib.axes.AxesSubplot at 0x14c0f2f0>

# names growing out of favor?

Figure 2. Number of Births per Year, Selected Names

For John, Harry, and Mary, they have bimodal peaks about 1920s and 1950s. For Marilyn, the name became steadily popular from the 1930's to the late 1950's. For all four names, we observe a fall in births per year. Are those names really becoming more uncommon? We will discover what is happening in the data below.

Curious, I plotted my name to see the birth time trends of 'Wayne'. It follows the same rise, peak, and fall around the 1950's, though it followed less of a bimodal distribution.

Figure 3. Number of Births per Year, For Name: Wayne

Baby Name Diversity

First, a spoiler: the drop in births for certain names have something to do with the name diversity- what parents choose to name their child. The trend changes from 1950's onwards. To examine this change, we turn to the variable we created earlier, the proportion of births in each group by year and sex. So we create a pivot table from the 'top1000', but this time with the sum values as 'prop', 'year' as rows, and 'sex' as columns.

This will allow us to plot Figure 4. Note how proportion total starts at 1.0 in 1880, and slowly drops in 1960 for females and in 1970 for males. The decline in proportion of births accounted by the top 1000 names has declined to around 74% for females and 85% for males by 2010. That means the share of births for other names outside of the top 1000 has risen. More parents are choosing different, more uncommon names to call their newborns.

Figure 4. Proportion of Top Births by Year and Sex

We can check this by examining the boys and girls DataFrames we created earlier. We subset the year 2010, sort by proportion in descending order, then take the cumulative sum of the proportions of births. Taking the first 10 names, we see that the top name were roughly 1.15% of the total male births in 2010. Using '.searchsorted(0.5)' to find the sorted index of the 50th percentile, Python returns 116. Therefore 117 names consist of 50% of the male births in 2010. We compare this number to the 50% percentile of male births in 1880, which is 25. From 1880 to 2010, the number of names in the top 50% percentile of male births increased over 350% from 25 to 117. Male name diversity sure increased over the years.

Code:

# exploring increases in naming diversity
# fewer parents choosing common names for children

In [15]: table = top1000.pivot_table('prop', rows=
   ....: 'year',cols='sex', aggfunc=sum)

In [17]: table.plot(title='Sum of table1000.prop by year and sex', \
   ....: yticks=np.linspace(0,1.2,13), xticks=range(1880,2020,10))
Out[17]: <matplotlib.axes.AxesSubplot at 0x6a1ead0>

# names proportion going down from 1 from top 1000 names

In [20]: df = boys[boys.year==2010]

In [21]: df
Out[21]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1000 entries, (2010, M, 1676644) to (2010, M, 1677643)
Data columns (total 5 columns):
name      1000  non-null values
sex       1000  non-null values
births    1000  non-null values
year      1000  non-null values
prop      1000  non-null values
dtypes: float64(1), int64(2), object(2)

In [22]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()

In [23]: prop_cumsum[:10]
Out[23]:
year  sex
2010  M    1676644    0.011523
           1676645    0.020934
           1676646    0.029959
           1676647    0.038930
           1676648    0.047817
           1676649    0.056579
           1676650    0.065155
           1676651    0.073414
           1676652    0.081528
           1676653    0.089621
dtype: float64

In [24]: prop_cumsum.searchsorted(0.5)
Out[24]: 116

# index 116, so 117 names in top 50% in 2010

In [25]: df1900 = boys[boys.year==1900]

In [26]: prop1900 = df1900.sort_index(by='prop', ascending=False).prop.cumsum()

In [27]: prop1900.searchsorted(0.5)+1
Out[27]: 25

# in 1900, top 50% of names covered with 25 names
# so there is a large increase in name diversity

In [28]: def get_quantile(group, q=0.5):
   ....:     group = group.sort_index(by='prop', ascending=False)
   ....:     return group.prop.cumsum().searchsorted(q)+1
   ....:

In [29]: diversity = top1000.groupby(['year','sex']).apply(get_quantile)

In [30]: diversity = diversity.unstack('sex')

In [31]: diversity.head()
Out[31]:
sex    F   M
year
1880  38  14
1881  38  14
1882  38  15
1883  39  15
1884  39  16

In [32]: diversity.plot(title='Number of popular names in top 50%')
Out[32]: <matplotlib.axes.AxesSubplot at 0x2075c1f0>

This diversity increase can be said for female births as well. Rather than take the number of names in the top 50th percentile for 1880 and 2010, we calculate them for all the years, and both male and female names. Define a new function, 'get_quantile(group, q=0.5)', which sorts the 'group' argument by variable 'prop' in descending order, and returns the index of the sorted cumulative sum value at 0.5, adding 1 at the end to account for the index.

Pass this method through the top 1000 names grouped by year and sex into the 'diversity' DataFrame. Reconfigure the DataFrame table by placing 'sex' in the columns with '.unstack()' to finalize the table. Take a peak at the first 5 years of the 'diversity' data with '.head()', and observe that 38 female names and 14 male names accounted for 50% of the top births in 1880. Lastly we plot the 'diversity' DataFrame, shown below.

Figure 5. Popular Baby Names in 50% percentile

We see a distinct increase in name diversity around 1985 for both males and females. Historically, female names were more diverse than male names. By 2010, the number of top female names accounting for the top 50 birth percentile more than doubled the male name counterpart.

Again, we reach the end of another lengthy, but I hope, enjoyable post in Python and Pandas concerning baby names. We explored and manipulated a dataset of 1.6 million rows, re-organized DataFrames, created new variables, and visualized various name metrics, all after accessing data split into 131 text files. There is more on baby names we will explore in Part B of this post. So stay tuned for more!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.

Tuesday, December 10, 2013

Importing Data into a Database Engine, SQL Server 2012

Hello Readers!

Welcome back. We need data to be able to query tables. So today we will walk through how to import data into a Database Engine in SQL Server Management Studio 2012. Click here for more info on SSMS 2012.

Let us get started!

The Setup

First, open SQL Server Management Studio (SSMS). We are greeted with a dialogue box to connect to a server. I will use Windows Authentication to connect to my local server.

Connect to a Server

After we have connected, we see the Object Explorer to the left of the screen. The Server will show a green arrow to indicate a successful connection. The Object Explorer allows us to view and manipulate the different utilities, tools, and capabilities in SSMS. The expanded explorer box is shown below. Note the green 'good to go' arrow by the server name.

Connection Established

Importing the Data

Now we need to import the data. It would be helpful if we had the target data already on the computer, and in this case, we will be importing the familiar AdventureWorks database which exists as an Access database file. We can also import it as an mdf (mirror disk file) from this link.

Start by right clicking the database to where the data will be imported. Now from Tasks, click Import Data... towards the bottom.

Import Data Button

In step, we are in the SQL Server Import and Export Wizard. As shown below. Select the appropriate data source and locate the file on the hard drive:

Select the Data Source and Location

After choosing the input specifications, we need to choose the output Destination. I will import the data into the POWER (short for PowerPivot) database.

Choosing the Destination

And then for the nitty-gritty of specifying which tables to copy. We have two options: manual selection or SQL query. Select either, I will select manual because I will be copying all the data.

Specify Which Selection Method

After which, we can check the boxes of the tables we require. Clicking the box at the top left will select all of the tables. Click Next.

Selecting Tables

We are then given a review of the selected tables before they are imported into the database. The table attributes and types are shown to verify the correct tables have been selected, below.

Data Type Review

Clicking Next and Finish, we now have the tables in the POWER database. The Object Explorer below reflects the newly imported tables.

Object Explorer with New Tables

And with the new query window open, we can now query tables that we require!

Blank Query Window

That concludes this post on how to import data into a database in SQL Server Management Studio 2012. Future posts will include SQL querying and use of the SSMS Analysis Services to analyze the data. Also, I will include a post on using R to connect to SQL Server to retrieve tables. Please look forward to the new posts!

Thanks for reading!

Wayne

Pages