Valence Analytics: databases

Showing posts with label databases. Show all posts

Monday, September 29, 2014

Handling Large Data with SQL Server

Hello Readers,

Big Data is all the rage these days. On this blog we have mainly used R as our analysis program of choice (sorry Python) to examine, model, and predict on the data. R is optimal for data with hundred thousands of rows or less, and dealing with larger data sets with millions of rows or more usually slows R down to a crawl. (Remember the neural network post?) Another deciding factor on computation speed is your computer setup.

Because there are millions of rows in big data sets, we need only sample a few hundred thousand of them into R. But where will we store the millions of rows prior to sampling? Easy, into a data warehouse made for million plus row data sets, a SQL Server database.

We have not dealt with hundreds of millions or billions of rows, so resist the urge to mention the Hadoop distributed file system. Or should I say not yet? Anyways, here we will load million plus row U.S. Census data sets in preparation for R sampling.

Getting Census Data

Specifically we concentrate on two data sets, U.S. Census American Community Survey 2011, where we can find nationwide population and housing data. You can download the data via ftp here, or choose 2011 ACS 1-year PUMS from the ACS site. The ACS encompasses social, economic, demographic, and housing data. A screenshot of the page is shown below:

Figure 1. American Community Survey (Census) Webpage

After clicking the red highlighted link, 2011 ACS 1-year PUMS, we choose from CSV or SAS file format. Take your pick, I chose CSV for importing into SQL Server.

Figure 2. Choose File Format - CSV

Next, the site gives us choices whether to download the entire dataset for the U.S. or individual states. Of course, download both the Population and Housing data for the entire U.S. Be warned, these two files take up nearly 1 GB of hard drive space, so be patient! Download accelerator anyone?

Figure 3. Download Population and Housing Data Files

After we have the zip files downloaded, unzip them. Note that the housing data 'ss11hus*' and population data 'ss11pus*' have parts a and b.

Figure 4. Unzipped Files

You can also find the helpful README file for the data at the ACS site as well.

Importing into SQL Server

Now for the crucial part, where we import the data into a database system, SQL Server, instead of reading the CSV files into R directly. After connecting to your server, select which database where you would import the CSV files. Right-click and under Tasks, > choose Import Files.

Figure 5. Import Files Option

Next we specify the data source. The CSV files are not Excel or Access files, rather they are a Flat File Source. So select the appropriate option, and select a CSV file from the unzipped file lcoation. On the left you can see a Columns view option under General. You can preview the file contents prior to importing it.

Figure 6. Select Flat File Source

Figure 7 shows the Columns for our preview, so we can ensure the proper headers and values are read by the SQL Server Wizard.

Figure 7. Column Previews

Even though we right-clicked a certain database, SQL Server still asks us in which database we want to import the data. You can change your mind, if you clicked wrong earlier, so take your pick.

Figure 8. Table Destination

Here we encounter the Source Table Option window. We can alter the schema and name of table with Edit Mapping at the bottom. The current schema and name are set as 'dbo' and 'ss11pusa', respectively. If you want to use a different schema than the default, now is the time to impose the change.

Figure 9. Select Source Tables

SQL Server gives us a last overview option before we start the importing process. Ready? Click Finish.

Figure 10. Last Check

And here we go! Look at the Import Wizard chug along as it imports over a million rows!

Figure 11. Import Progressing

When the process is finished, we are rewarded with... a new table lots of rows. Take a look below. 1.6 million? Not bad. Now you just have to import the remaining 3 for a complete set!

Figure 12. Execution Successful

Remember to refresh the Server instance for it to reflect the new table changes. Now you can query away in a new query window to explore the data, or you can keep the Server on and jump into R to establish a database connection using ODBC (open database connectivity).

Figure 13. Locating Imported Table

I thought this post made up for the previous few posts' lack of pictures. Was the compensation count this time sufficient (13)? Here we learned how import a large data set, separated into 4 tables, into a database management system, SQL Server. The next post will feature the next step in accessing Big Data in R, the database connection. There we shall use SQL Server as a data warehouse and R as the analytics tool. Stay tuned.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Sunday, August 11, 2013

P1.7: Exploring Power View Reports in Excel

This post explores the Reports produced by Power View in Excel 2013. The data we will use is from the AdventureWorks dataset, which is in the Chapter 1 workbook from the Alberto Ferrari PowerPivot book.

Creating a Power View Report

Similar to a Dashboard we created in P1.5, Power View in Excel can construct a report which can convey data to the reader, but with more visualization capabilities. Reports from Power View require minimal mouse clicks to embellish and offers Excel users a graphical data exploration tool. PivotTables allow Excel users to explore data using tables, whereas Power View allows us to explore the data graphically in a simple charting environment.

Power View works through the Data Model and not Excel tables, so if you add tables to the Power View report, they will be added to the Data Model. Make sure you have MS Silverlight installed for Power View to work.

Click on the Power View button on the Insert tab in Excel. We will create a new Power View report, so click OK when Create a Power View sheet is selected.

Fig. 1: Power View button

Now we have a Power View report from which we can add fields. Start by clicking on the report and then selecting the ProductCategory and SalesAmount fields from the Sales table from the Power View Fields on the right pane. You can also add an appropriate title to the top of the Report, and this particular one I will name "Sales by Category and Geography".

Fig. 2: Product Category and Sales in Power View Report

Now suppose you would like to show more information, such as sales by geographical region. To start another table, click on an empty area in the Power View Report. If you add more fields with the previous table open, then the fields will be added to the previous table. Select the fields of TerritoryCountry and SalesAmount in the Sales table. This will display a table showing the country and the sales.

Fig. 3: Country and Sales table in Power View Report

Looking at the Sales by Country in a tabular format is not exactly visually appealing. However, we can portray the geographic data on a map, which will enhance the analysis visually. An internet connection is required to map the data. With the TerritoryCountry table selected, click on the Map button in the Design tab, and the table is converted into a map chart showing SalesAmount as circles over the respective country.

Fig. 4: Map Chart of TerritoryCountry and SalesAmount

We can re-size and zoom in or out of the map to display the area we would like to exhibit on the report. The product category table can also be converted into a visually appealing format as a chart by selecting the Column Chart button in the Design tab.

Fig. 5: Column Chart of ProductCategory and SalesAmount

Adding Filters to the Power View Report

While creating a report in Power View, we can also add a filter to aid in analyzing the data. Let us include the Office and SalesManager fields as filters to the report to see the stratified data graphically. Start by adding the Office and SalesManager as two tables on the above report.

Fig. 6: Office and SalesManager Tables on the Sales Report

Now that we have the two tables added, we can change them to what Excel calls a Slicer. With the desired table selected, click on the Slicer button in the Design tab.

Fig. 7: Office Table as Slicer in the Sales Report

In Figure 7, the Office table is now a Slicer, or filter, and the SalesManager table is selected ready to be converted into a Slicer as well, with the highlighted Slicer button. With both fields as Slicers, we can now stratify the Office and SalesManager fields to show data on the Sales Report which only relate to that specific selection in the field. For example, we can select the Seattle office and Alberto Ferrari as the sales manager. The data on the Power View report would only reflect the Sales by the Seattle office and those under the sales manager, Alberto Ferrari.

Fig. 8: Sales in the Seattle Office with Alberto Ferrari as Sales Manager

Filters and slicers allow us to view the data under different field selections. In the Power View Report, the chart is updated automatically with the selected filter, which enables fast and efficient data analysis. We can clear the filters by clicking on the Eraser button on the top right-hand corner of the filter table, shown in Figure 9 below.

Fig. 9: Clear Filter Button

To conclude, Excel charts allow large areas customization, with many options to choose. In contrast, Power View charts give the user the ability to visualize the data and spot patterns from the charts with minimal effort in creation.

Thanks for reading,
Wayne

Tuesday, August 6, 2013

P1.6: Notes on PivotTable Conversion to Formulas and Row-Column Oriented Databases

This post will touch on the nuances on what happens to a PivotTable is converted to formulas in Excel, and how PowerPivot uses column oriented databases. (Notes for P1.5).

When Converting to Formulas

When you convert a PivotTable to an Excel table using the Convert to Formulas button, Excel replaces the cells with formulas which reference the Data Model.

Fig. 1: Convert to Formulas Button

For example when converted, the Internet sales table now has labels with from the formula CUBEMEMBER, and table values derived from CUBEVALUE.

CUBEMEMBER Function

Fig. 2: Sample Column Label Formula, CUBEMEMBER

Row and Column labels are retrieved from the Data Model or Cube by the CUBEMEMBER formula. The function accesses the the Data Model in this workbook which is generically called ThisWorkbookDataModel and to return the specific year of 2006, we need the table DimTime which has the FiscalYear. So the column formula for the year 2006 (Figure 2) is:

=CUBEMEMBER("ThisWorkbookDataModel","[DimTime].[FiscalYear].&[2006]")

CUBEVALUE Function

Fig. 3: Sample Table Value Formula, CUBEVALUE

The values in the table are retrieved through the Data Model, which is also a Cube. As with the CUBEMEMBER function, this particular CUBE function also accesses the the local Data Model in the workbook, ThisWorkbookDataModel. The specific value it finds is the value at $B$3, the sum of the SalesAmount. It further narrows it down to the territory group of Europe, which is in $B5, while looking at year 2006, in cell D$4. The CUBEVALUE function references a set of members as coordinates (hence, CUBEMEMBER for the row and column labels), which is how you would navigate OLAP Cubes. The table value formula for the sum of Internet sales in Europe in 2006 (Figure 3) is:

=CUBEVALUE("ThisWorkbookDataModel",$B$3,$B5,D$4)

MDX

Excel uses MDX to communicate with PowerPivot, and references the data in PowerPivot. The MDX syntax, such as CUBEVALUE, is similar to maneuvering in OLAP Cubes by providing a set of coordinates. In fact, PowerPivot stores the data as Cubes and are processed automatically.

Row and Column Oriented Databases

Row oriented databases, such as SQL Server, store data in a row fashion, which lists an entry with all the field values after. Generally row oriented databases have to scan all the rows to complete a query of an entire column of values, such as the summation of Sales. A complete table scan is required if you query one column or twenty columns, and with databases with many entries, it can be time consuming and not efficient.

On the other hand, column oriented databases, such as the xVelocity analytics engine in PowerPivot, stores data by column. If you query Sales in PowerPivot, it would only need to query the column of Sales. Data retrieval-wise, it is vastly more efficient to have a column oriented database because while querying a large entry dataset. You usually only analyze a few fields, mostly for an aggregate value, at a time. So querying a few fields (columns) would be more economical than querying all the rows for the few fields as in a row oriented database.

The most efficient query of a row oriented database would be a single entry row, since all the data for the row is stored together. It would be the most inefficient for a column oriented database because it would have to query all columns for that one row. In comparison, the most efficient query for a column oriented database would be a query of one column, which is also the most inefficient for the row oriented database because it would have to query all rows for a single field column.

Thanks for reading,
Wayne

Pages