Loading...

Saturday, April 18, 2015

Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price


Hello Readers,

Here we delve into a quick evaluation of quality metrics of the rankings of unlabeled reports. Previously we relied on labeled reports for the true fraud/non-fraud label for evaluation, and including the training set of reports, there will undoubtedly be unlabeled reports ranked in the near the top. Should those unlabeled reports be in the top in terms of likelihood for being fraudulent? That is where we use our predictive model to determine the classification of those unlabeled reports based on the the labeled reports.

One method compares the unit price of that report with that of the average unit price of transactions of that particular product. If the difference between the unit price for that transaction and the typical unit price transaction for that product is large- then that report is likely to belong in the top possible fraudulent transactions predicted by the model. Here we provide a method in evaluating the model performance.

(This is a series from Luis Torgo's  Data Mining with R book.)


Normalized Distance to Typical Price


To calculate the normalized distance to typical price (NDTP) for a specific product of a transaction (report), we take the difference of the unit price of that transaction and the overall unit median price of that product, and divide it by the inter-quartile range of that product's prices. 

Below is the code for calculating the NDTP. Starting on line 2, we define the 'avgNDTP' function to accept the arguments 'toInsp', 'train', and 'stats'. 'toInsp' contains the transaction data you want to inspect; 'train' contains raw transactions to obtain the median unit prices; 'stats' contains the pre-calculated typical unit prices for each product. This way, we can calling the 'avgNDTP' function repeatedly, sometimes with the pre-calculated stats is more efficient than calculating the stats each time. 

Normalized Distance to Typical Price Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> # normalized distances of typical prices ####
> avgNDTP <- function( toInsp, train, stats ) {
+   if(missing(train) && missing(stats))
+     stop('Provide either the training data or the product stats')
+   if(missing(stats)) {
+     notF <- which(train$Insp != 'fraud')
+     stats <- tapply(train$Uprice[notF],
+                     list(Prod=train$Prod[notF]),
+                     function(x) {
+                       bp <- boxplot.stats(x)$stats
+                       c(median=bp[3],iqr=bp[4]-bp[2])
+                     })
+     stats <- matrix(unlist(stats),
+                     length(stats), 2, byrow=T,
+                     dimnames=list(names(stats), c('median','iqr')))
+     stats[which(stats[,'iqr']==0),'iqr'] <-
+       stats[which(stats[,'iqr']==0),'median']
+   }
+   
+   mdtp <- mean(abs(toInsp$Uprice-stats[toInsp$Prod,'median']) /
+                  stats[toInsp$Prod,'iqr'])
+   return(mdtp)
+ }

The if statements check to see if sufficient arguments are present, and if the 'stats' input needs to be calculated. If so, line 6 begins to calculate the typical unit price for each product from those inspected transactions which are not labeled fraudulent. From there, we calculate the median and IQR starting on line 9. After unlisting and transforming the results, 'stats' into a matrix on line 13, we replace those with IQR=0 with their median value, because we cannot divide by zero.


On line 20, we create the variable 'mdtp' to hold the normalized distances from the typical price using the unit price from 'toInsp' and the median and IQR from the 'stats' provided or the 'stats' generated from the 'train' argument.

We will use this NDTP metric to evaluate and compare future predictive models, and seeing how well they identify fraudulent transactions given the few inspected transactions we have in our dataset.


So stay tuned for the upcoming predictive model posts!



As always, thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Saturday, March 28, 2015

Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall


Hello Readers,

Before we start creating classification models to predict fraudulent transactions, we need to understand how to evaluate model performance. Not only do we want to know if the particular transaction is predicted as fraudulent or not, we also want to know the confidence in that particular prediction (such as a probability of fraud).

To do this, we will use two measures, precision and recall. Beginning where we left off at the end of Part 3, start R and let us get started.

(This is a series from Luis Torgo's  Data Mining with R book.)


Precision & Recall


Because we have a limited number of transactions we can inspect, k transactions out of 401146 total transactions, we can determine the precision and recall of the model on those k transactions. Those k top transactions contain the most likely fraud candidates predicted by the model.

Precision:
Within those k transactions, the proportion of frauds among them; also known as positive predictive value

Recall:
Among the fraudulent transactions, the proportion of frauds in the k transactions; also known as sensitivity

There are trade offs when making decisions concerning the level of recall and precision. It could be easy to use a large number of k transactions to capture all of the fraudulent transactions, but that would result in low precision as there will be a larger proportion of 'good' transactions in the k top transactions. Given the feasibility of inspecting k transactions, we want to maximize our resources in the transaction inspection activities. So when we inspect x transactions in t hours, and we manage to capture all the frauds in those x transactions, we will have accomplished our task. Even if a large proportion of those x transactions were valid transactions, we would value high recall in this situation.



Performance Models


We now turn to evaluation of different inspection effort levels using different visual graphics. Some statistical models might be suited towards precision or towards recall. We would rank the class of interest (fraud) and interpolate the precision and recall values at different effort limits to determine those precision and recall values. Different effort limits will yield different precision and recall values, thus creating a visual performance model where we can see the optimal effort limits of high/low precision and recall values.


Using package "ROCR" that contains multiple functions for evaluating binary classifiers (yes/no, fraud/no fraud variables), including functions that calculate precision and recall, which we will then use to plot the performance model. Below, we load the "ROCR" library and the "ROCR.simple" data example with predictions in R. Next, in line 4, with the "prediction( )" function, we create "pred" using the "$predictions" and the true values in "$labels". Afterwards, we pass the "pred" prediction object to "performance( )" to obtain the precision and recall metrics. In line 5, we specify the "prec" and "rec" (precision and recall respectively) as arguments. 

R Code:

1
2
3
4
5
6
7
> # precision and recall functions
> library(ROCR)
> data(ROCR.simple)
> pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
> perf <- performance(pred, "prec", "rec")
> plot(perf)
> 

Plotting the "perf" object in line 6 will plot the below 'sawtooth' graphic.

Figure 1. Sawtooth Precision and Recall

As you can observe, at high levels of precision, the recall is not always high, and the same for high levels of recall. Note the axis limits of (0.5, 1.0) precision on the Y, and the (0.0, 1.0) range for recall on the X axis. This means that for 100% recall, where all the positives are captured, the precision (proportion of positives in the sample results) falls below 0.5,  due to capturing non-positives as well. Whereas for 100% precision, the recall falls to near zero, as not that many of the total positives are captured by the test. However, note the top right point, where the precision and recall both reach above 0.80- that is a good trade-off point if we were to take into account both precision and recall.

Evidently, the 'sawtooth' graph is not smooth at all. There is a method where we use the highest achieved precision level as we increase in recall value to smooth the curve. This interpolation of the precision takes the highest value of precision for a certain value of recall. As the value of recall increases, the maximum precision changes (decreases) to the maximum precision of precision values at greater recall values. This is described by the formula below, where r is the recall value, and r' are those values greater than r:


Precision Interpolation

We show this smoothing by accessing the y values (precision) from the "performance( )" object.  Below is the R code for a function that plots the precision-recall curve with interpolation of the precision values. Line 3 checks if the "ROCR" package is loaded, and lines 4 and 5 are familiar, as they create the "prediction( )" and "performance( )" objects.

R Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
> # precision smoothing
> PRcurve <- function(preds, trues, ...) {
+   require(ROCR, quietly=T)
+   pd <- prediction(preds, trues)
+   pf <- performance(pd, "prec","rec")
+   pf@y.values <- lapply(pf@y.values, function(x) rev(cummax(rev(x))))
+   plot(pf, ...)
+ }
> 
> # plot both graphics side by side
> par(mfrow=c(1,2))
> plot(perf)
> PRcurve(ROCR.simple$predictions, ROCR.simple$labels)
> 

Line 6 accesses the "pf@y.values" and takes the cumulative maximum value of the reverse ordered y values, reverses them back, and replaces the original y values. The "cummax( )" function returns a vector with the maximum value in order by index (first, second, third value). By reversing the y values, we start decreasing at recall 1.0, and take the maximum value until we get another higher precision value, and reverse the order afterwards. Starting from line 11, we plot the two curves side by side (1 row, 2 columns) by calling the previous 'sawtooth' "perf" object and the "PRcurve" function.



Figure 2. 'Sawtooth' and Interpolated Precision Precision-Recall Curves

Note how the smooth curve has all the highest possible precision values at increasing recall values. Also, in line 6, we use "lapply( )" to apply the function over the y values so we can pass multiple sets of y values with this "PRcurve" function.



Fraudulent Transactions



So how does this apply to inspecting frauds? We know from the beginning that we aim to increase our recall metric to capture as many total frauds as possible for optimal efficiency. Last post we talked about transaction outliers and their outlier score. With the outlier scores, we can set a limit by establishing a threshold for the outlier score where a transaction is predicted as a fraud or non-fraudulent. That way, we have the predicted values. Additionally, we can compare the inspected results when we run the inspected values through the model and compare them with the predicted fraud status and the inspected status for precision and recall values.



Lift Charts


Lift charts provide more emphasis on recall, and will be more applicable to evaluating fraudulent model transactions. These charts are different as they include the rate of positive predictions (RPP) on the X axis and the Y axis displays the recall value divided by the RPP. The RPP is the number of positive class predictions divided by the total number of test cases. For example, if there are 5 frauds predicted (though not all might be true frauds) out of 20 test cases, then the RPP is 0.25.

Again, the "ROCR" package includes the necessary functions to obtain the lift chart metrics.   We still use the "$predictions" and "$labels", but now use "lift" and "rpp" as our Y and X axes in the "performance( )" function in line 3. For the cumulative recall function, "CRchart", we use the same format as the "PRcurve" function but similarly substitute in "rec" for recall and "rpp" for rate of positive predictions.

R Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
> # lift chart
> pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
> perf <- performance(pred, "lift", "rpp")
> par(mfrow=c(1,2))
> plot(perf, main="Lift Chart")
> 
> # cumulative chart with rate of positive predictions
> CRchart <- function( preds, trues, ...) {
+   require(ROCR, quietly=T)
+   pd <- prediction(preds, trues)
+   pf <- performance(pd, "rec", "rpp")
+   plot(pf, ...)
+ }
> CRchart(ROCR.simple$predictions, ROCR.simple$labels,
+         main='Cumulative Recall Chart')
> 

This yields two graphs side by side:



Figure 3. Lift Chart and Cumulative Recall Chart

The more close the cumulative recall curve is to the top left corner of the graph, the better the indication from the model. While lift charts contain the comparison of recall and RPP, the cumulative recall and RPP curve applies more here. It shows the recall value with changing inspection effort, where the number of cases are tested and determined to be frauds. At 1.0, all cases were tested, and therefore, all frauds were captured, showing 1.0 recall.


Next we will be exploring normalized distance to obtain the outlier score with which we will determine the threshold and evaluate the many models' recall metrics. Stay tuned, folks!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Tuesday, January 27, 2015

Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers


Hello Readers,

Welcome back! In this post we continue our case study on Detecting Fraudulent Transactions. In Part 2, we cleaned/imputed the 'Sales' dataset of missing values, but we still have some wild 'Quantity' and 'Value' variables which are present in the data. In this post we will address handling these pesky outliers, but not exactly in terms of large numbers. Think smaller.

Ready? Start R and let's go.

(This is a series from Luis Torgo's  Data Mining with R book.)

Those Pesky Outliers


When we first think of outliers, we might picture transactions with products being sold at enormous quantities, or products being sold with eye-brow-raising prices. $1,000 for each unit? Is that legit or a typo? But keeping in mind our goal of predicting transactions from a training set with inspected transactions, we need to be aware of the low number of manually inspected transactions ("ok" n=14,462, "fraud" n=1,270, vs "unkn" n=385,414) we will use to create a model to predict the un-inspected transactions (over 96% of the transactions). It turns out that there are 985 products with less than 20 transactions! Therefore, we should be keeping an eye on those products with few transactions, as well as concentrating on products with an average number of transactions, but with crazy 'Quantity' or 'Value' counts. 

Robust Measures

Load the 'salesClean.rdata' that we created in Part 2. Our plan of attack will use robust measures that are not sensitive to outliers, such as the median and inner-quartile range (IQR) of the unit price for each product, using the 'tapply()' function. Be sure to use only the transactions that are not labeled 'fraud', or else our averages would not be accurate. The we implement the 'tapply()' function to grab the median and quartile variables from 'boxplot.stats()'. Then we take the list output from 'tapply()' and transform it into a matrix, after we 'unlist()' the output.

Robust Measures:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> load('salesClean.rdata')
> attach(sales)
> # find non-fraud indexes
> notF <- which(Insp != 'fraud')
> # median and IQR of unit-price for each product
> ms <- tapply(Uprice[notF], list(Prod=Prod[notF]), 
+              function(x) {
+                bp <- boxplot.stats(x)$stats
+                # returns median and iqr = 75% - 25% percentiles
+                c(median=bp[3],iqr=bp[4]-bp[2])
+              })
> # transforms into a matrix
> # with median unit price value for each product, along with iqr
> ms <- matrix(unlist(ms),
+              length(ms), 2, byrow=T, 
+              dimnames=list(names(ms), c('median', 'iqr')))
> head(ms)
      median      iqr
p1 11.346154 8.575599
p2 10.877863 5.609731
p3 10.000000 4.809092
p4  9.911243 5.998530
p5 10.957447 7.136601
p6 13.223684 6.685185
> 

Now we have a matrix of the median and IQR unit price for each product using non-fraud transactions. Note the printed median and IQR values for the first 6 products. Below, we plot the median and IQR unit prices twice, unscaled on the left, and in log scale on the right to accommodate the extreme values. Because the log plot on the right will have a more even distribution of values, we will plot them in grey, and later overlay them with unit price points of products with less than 20 transactions.


Plotting Measures:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
> # plot the median and unit price for each product
> # also show log of plot to accomodate the extreme values
> par(mfrow= c(1,2))
> plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="")
> # now grey log points
> plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="",
+      col="grey", "log"="xy")
>
> # which are the few products less than 20 transactions
> smalls <- which(table(Prod) < 20)
> # draw the black +'s over the grey log points 
#>  to highlight products with few transactions
> points(log(ms[smalls,1]), log(ms[smalls,2]), pch="+")

Below, we see why the only log plot on the right needed to be grey. The extreme value on top-right of the unscaled plot completely relegates the rest of the points in a dense cluster.  We would not have been able to see the overlay points anyways. However, with the log transform, the distribution of robust measures of unit prices is more clearly visualized. The black points representing products with less than 20 transactions fall in line with the rest of products.


Figure 1. Plots of Median and IQR Product Values in 2 Scales
We observe that many products have similar medians and IQR spreads, which is good for those products with fewer transactions. Even so, we need to examine the variables of those products with few transactions more closely.

Products with Few Transactions

We might assess outlier transactions from products with few transactions by grouping them together with products with similar distributions to gain statistical significance. From the above plots, there are many products with similar medians and IQR spreads. That is why we earlier obtained the median and IQR values for each product, both resistant to extreme outlier values. We assume that the unit-price for a product is normally distributed around the middle median value, with spread of the IQR.


However not all of the fewer transaction products have distributions similar to other 'normal' products. So we will have more difficulty determining whether those transactions are fraudulent or not with statistical confidence. 

Following the robust measures theme, we proceed with a robust nonparametric test of the similarity of distributions, the Kolmogorov-Smirnov test. The K-S test informs us of the null hypothesis that two samples come from the same distribution. The statistic we obtain from the K-S test gives us the maximum difference between two empirical cumulative distribution functions.


Running the K-S Test



Using the 'ms' matrix object with the median and IQR values we generated earlier, we scale 'ms' to 'dms', then create 'smalls' which contains the integer index of our scarce products. The 'prods' list contains the unit price for each transaction for each product. Then we finish preparation by generating an empty NA matrix with rows the length of 'smalls', and 7 columns for the resulting statistics.

Beginning in line 23 we start to loop through the scarce products. The first operation on line 24 is crucial. We use 'scale()' to subtract the entire 'dms' by each row with the median and IQR values for the ith scarce product. That way each scarce product is compared to all the other products. Line 26 removes any negative values and multiplies the matrices together to form a 'difference' value, and in line 29 we run the K-S test with 'ks.test()',  comparing the unit-prices of that ith scarce product to the second smallest difference value. Why the second smallest? Because the smallest difference would be zero, or the same product median and IQR values as itself- remember the scaling in 'd' applies to all the products. To record the results of the ith product, we store it in the ith row, in the 'similar' matrix we created, which happens to have the number of rows as number of scarce products. We store the integer of similar product in the first column, the K-S statistic in the second, and then storing the unscaled median and IQR values for the scarce product and the similar product.

Finding Similar Products:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
> # K-S Test for similar transactions ####
> # for products with few transactions
> # create matrix table with similar product
> # from the Kolmogorov-Smirnov test
> # includes KS statistic, pvalue
> # and median and iqr for both product and similar Product
> 
> dms <- scale(ms)
> smalls <- which(table(Prod) < 20)
> prods <- tapply(sales$Uprice, sales$Prod, list)
> # create emtpy similar matrix, fit for few product #rows and 7 variables
> similar <- matrix(NA, length(smalls), 7,
+   # add row and column names
+   dimnames=list(names(smalls),
+   c("Similar", "ks.stat", "ks.p",
+   "medianP", "iqrP",
+   "medianSim", "iqrSim")))
> # iterate through each row in the median iqr Uprice matrix
> # with few transaction products only.
> # scale to all to current row to find most similar product
> # using the KS test in all products.
> # does not iterate through all products in dms.
> for(i in seq(along=smalls)) {
+   d <- scale(dms, dms[smalls[i], ], FALSE)
+   # removes negatives through matrix multiplication
+   d <- sqrt(drop(d^2 %*% rep(1, ncol(d))))
+   # ks test of current product and next best product with lowest difference
+   # because best product is itself
+   stat <- ks.test(prods[[smalls[i]]], prods[[order(d)[2]]])
+   # add results to the similar matrix:
+   # similar product, KS statistic, KS pval,
+   # product values, similar product values (median, iqr)
+   similar[i, ] <- c(order(d)[2], stat$statistic, stat$p.value,
+   ms[smalls[i], ], ms[order(d)[2], ])
+ }
> 
> head(similar) # so first product p8's similar product is p2829
    Similar   ks.stat       ks.p  medianP      iqrP medianSim    iqrSim
p8     2827 0.4339623 0.06470603 3.850211 0.7282168  3.868306 0.7938557
p18     213 0.2568922 0.25815859 5.187266 8.0359968  5.274884 7.8894149
p38    1044 0.3650794 0.11308315 5.490758 6.4162095  5.651818 6.3248073
p39    1540 0.2258065 0.70914769 7.986486 1.6425959  8.080694 1.7668724
p40    3971 0.3333333 0.13892028 9.674797 1.6104511  9.668854 1.6520147
p47    1387 0.3125000 0.48540576 2.504092 2.5625835  2.413498 2.6402087
> # confirm using levels
> levels(Prod)[similar[1,1]]
[1] "p2829"

We see from the results in 'similar' that the most similar product to first scarce product, 'p8', is 'p2827'. A quick check at line 46 verifies the matched product ID. It is nearly statistically significant (small p-value = 0.0647), which implies that there is a small probability of the differences occurring by chance- that while 'p2827' is the most similar, it can nearly reject the null hypothesis of the K-S test where the two samples come from the same distribution. We want the similar products to be from similar distributions to that of scarce products so we can group them together as similar products in order to overcome their few number of transactions. 
So we are looking for p-values at the other extreme, close to 1, which indicates that the two distributions are (nearly) equal.

Evaluation of Valid Products


Now that we have the 'similar' matrix, we can examine how many products have similar products sufficiently similar in distribution within a 90% confidence interval. Therefore, we rely on a p-value of 0.9, because we want to be as close to the null of equal distributions as possible. So below, we take the sum of the row-wise logic operation.

Viable Products:

1
2
3
4
5
6
7
8
> # check how many products have unit price distribution that is
> # significantly similar with 90% CI with KS value being 0.9 or greater
> sum(similar[, "ks.p"] >= 0.9) # 117
[1] 117
> dim(similar)
[1] 985   7
> save(similar, file="similarProducts.Rdata")
> 

Observe 117 products have sufficiently similar products, out of 985 products with less than 20 transactions. So while we can match almost 12% of the scarce products with another product, there are still 868 products with less than 20 transactions. However do not despair! Though we have quite a few we were not able to match, we did capture 117 products which we would have otherwise had trouble obtaining accurate fraudulent predictions later. (Remember to save the 'similar' object!)


Summary

Here we explored tracking non-traditional outliers in terms of products few number of transactions for the future purpose of more accurate fraud prediction. We used the Kolmogorov-Smirnov test to evaluate and find other products with most similar distributions using robust measures of median and IQR. The results show that 117 of the 985 scarce products have similar transactions with 90% CI. 

Next we will discuss the evaluation criteria of classifying fraudulent transactions from a predictive model. How do we know how well the model performed? How does one model compare to another? Stay tuned!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Monday, September 29, 2014

Handling Large Data with SQL Server


Hello Readers,

Big Data is all the rage these days. On this blog we have mainly used R as our analysis program of choice (sorry Python) to examine, model, and predict on the data. R is optimal for data with hundred thousands of rows or less, and dealing with larger data sets with millions of rows or more usually slows R down to a crawl. (Remember the neural network post?) Another deciding factor on computation speed is your computer setup. 

Because there are millions of rows in big data sets, we need only sample a few hundred thousand of them into R. But where will we store the millions of rows prior to sampling? Easy, into a data warehouse made for million plus row data sets, a SQL Server database. 

We have not dealt with hundreds of millions or billions of rows, so resist the urge to mention the Hadoop distributed file system. Or should I say not yet? Anyways, here we will load million plus row U.S. Census data sets in preparation for R sampling.


Getting Census Data


Specifically we concentrate on two data sets, U.S. Census American Community Survey 2011, where we can find nationwide population and housing data. You can download the data via ftp here, or choose 2011 ACS 1-year PUMS from the ACS site. The ACS encompasses social, economic, demographic, and housing data. A screenshot of the page is shown below:


Figure 1. American Community Survey (Census) Webpage
After clicking the red highlighted link, 2011 ACS 1-year PUMS, we choose from CSV or SAS file format. Take your pick, I chose CSV for importing into SQL Server.


Figure 2. Choose File Format - CSV
Next, the site gives us choices whether to download the entire dataset for the U.S. or individual states. Of course, download both the Population and Housing data for the entire U.S. Be warned, these two files take up nearly 1 GB of hard drive space, so be patient! Download accelerator anyone?


Figure 3. Download Population and Housing Data Files
After we have the zip files downloaded, unzip them. Note that the housing data 'ss11hus*' and population data 'ss11pus*' have parts a and b.


Figure 4. Unzipped Files
You can also find the helpful README file for the data at the ACS site as well.


Importing into SQL Server


Now for the crucial part, where we import the data into a database system, SQL Server, instead of reading the CSV files into R directly. After connecting to your server, select which database where you would import the CSV files. Right-click and under Tasks, > choose Import Files.

Figure 5. Import Files Option
Next we specify the data source. The CSV files are not Excel or Access files, rather they are a Flat File Source. So select the appropriate option, and select a CSV file from the unzipped file lcoation. On the left you can see a Columns view option under General. You can preview the file contents prior to importing it.

Figure 6. Select Flat File Source
Figure 7 shows the Columns for our preview, so we can ensure the proper headers and values are read by the SQL Server Wizard.

Figure 7. Column Previews
Even though we right-clicked a certain database, SQL Server still asks us in which database we want to import the data. You can change your mind, if you clicked wrong earlier, so take your pick.

Figure 8. Table Destination
Here we encounter the Source Table Option window. We can alter the schema and name of table with Edit Mapping at the bottom. The current schema and name are set as 'dbo' and 'ss11pusa', respectively. If you want to use a different schema than the default, now is the time to impose the change.

Figure 9. Select Source Tables
SQL Server gives us a last overview option before we start the importing process. Ready? Click Finish.

Figure 10. Last Check
And here we go! Look at the Import Wizard chug along as it imports over a million rows!

Figure 11. Import Progressing
When the process is finished, we are rewarded with... a new table lots of rows. Take a look below. 1.6 million? Not bad. Now you just have to import the remaining 3 for a complete set!

Figure 12. Execution Successful
Remember to refresh the Server instance for it to reflect the new table changes. Now you can query away in a new query window to explore the data, or you can keep the Server on and jump into R to establish a database connection using ODBC (open database connectivity).

Figure 13. Locating Imported Table
I thought this post made up for the previous few posts' lack of pictures. Was the compensation count this time sufficient (13)? Here we learned how import a large data set, separated into 4 tables, into a database management system, SQL Server. The next post will feature the next step in accessing Big Data in R, the database connection. There we shall use SQL Server as a data warehouse and R as the analytics tool. Stay tuned.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Wednesday, September 24, 2014

Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations


Hello Readers,

We settle into Part 4 of the Text Analysis Series with Python by examining frequency distributions, word selections, and collocations. As a refresher, collocations are a sequence of words that occur together unusually often, given individual word frequencies. 

A specific case would be a pair of words forming a collocation in a bigram, such as 'red wine'. Bravo or brava, a trigram sequence would be three words long. 'red' and 'wine' would occur together quite often, as opposed to generic 'the wine', and 'maroon wine' would make little sense. That demonstrates how collocations are resistant to substitutions- because only those certain words carry that meaning, so those specific words are used. And only those words, so they occur quite frequently together.

Before we jump into collocations, let's start with frequency distributions. In the 'nltk' module in Python, we have a number of text corpora available for analysis. Load it, and get yourself comfortable for the ride.


Frequency Distributions


So we did some counting in a previous NLP post. We will count in this post as well, but in a different process. We aim to quantify each unique token in a given text corpus. How many times does 'government' occur in the Inaugural Address corpus? What about 'lol' in the Internet Chat corpus? Use the 'FreqDist()' method! As we see below, 'FreqDist()' takes the text and creates a frequency distribution for the unique tokens, and they aren't all words. They can be periods, parenthesis, commas, etc. 


With the summary of 'fdist1', we observe that it has 19,317 samples, or tokens, and 260,819 total counts, or length of the text.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

#####
# frequency distributions

# create FreqDist object
In [3]: fdist1 = FreqDist(text1)

# summary of FreqDist object, 19317 unique tokens with 260,819 total tokens
In [4]: fdist1
Out[4]: <FreqDist with 19317 samples and 260819 outcomes>

# retrieve set of tokens
In [5]: vocab1 = fdist1.keys()

# display first 10 of set
In [6]: vocab1[:10]
Out[6]:
[u'funereal',
 u'unscientific',
 u'divinely',
 u'foul',
 u'four',
 u'gag',
 u'prefix',
 u'woods',
 u'clotted',
 u'Duck']

# display number of occurrences for 'whale' token
In [7]: fdist1['whale']
Out[7]: 906

# plot first 20 terms
In [8]: fdist1.plot(20, cumulative=False)

Accessing the '.keys()' method, we can assign the unique token to 'vocab1' and look into the first 10 elements. We see words such as 'funereal', 'unscientific', and 'divinely'. Lastly we can look up the count of a word. Take 'whale', since text1 is Moby Dick, and we see that text1 has 906 occurrences of 'whale'.


Lastly, the 'FreqDist' object has a plot function. We specify the number of terms, and whether the plot is cumulative or not, and Python returns Figure 1.


Figure 1. Frequency Distribution of 20 Terms in Text1

Notice how 'four' has the highest count in this sample of 20 terms, with 'hanging' coming in second. The words you see with barely any count have a count of 1- they only occur in the text corpus once! These special terms are named hapaxes (hapax singular). nltk has a special function just for identifying hapaxes. You guessed it, '.hapaxes()'!


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
######
# hapaxes are words that occur only once

# display first 10 occurrences
In [8]: fdist1.hapaxes()[:10]
Out[8]:
[u'funereal',
 u'unscientific',
 u'prefix',
 u'plaudits',
 u'woody',
 u'disobeying',
 u'Westers',
 u'DRYDEN',
 u'Untried',
 u'superficially']
 
# display total number of single occurrences 
In [9]: len(fdist1.hapaxes())
Out[9]: 9002

The first 10 hapaxes can be sliced from the function, and we can also see how many unique terms are in text1 by passing the result to the 'len()' method. We see that text1 has 9002 terms which occur only once.



Word Selections


Now we will take advantage of Python looping through iterable objects to select words with certain attributes of word length and occurrences in a corpus. This way we can search for defining words which capture the essence of a corpus, or track trends in word usage. We can throttle the word count or word length to our needs in different circumstances.


Word Length
We can create an if condition to accept only words with more than 15 characters in the set of unique tokens from text1, as shown below. Some long words which occur are: 'CIRCUMNAVIGATION', 'Physiognomically', and 'apprehensiveness'.


Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# using set theory to select words
# lengthy words of 15 characters or more
# {w | w E V & P(w)}
# [w for w in V if P(w)]
# the set of all w such that w is an element of V (vocab) and has property P

# get set of vocab in text1
In [10]: V = set(text1)

# iterate through V, grabbing each word with character length greater than 15
In [11]: long_words = [w for w in V if len(w) > 15]

# display sorted first 10 lengthy words
In [12]: sorted(long_words)[:10]
Out[12]:
[u'CIRCUMNAVIGATION',
 u'Physiognomically',
 u'apprehensiveness',
 u'cannibalistically',
 u'characteristically',
 u'circumnavigating',
 u'circumnavigation',
 u'circumnavigations',
 u'comprehensiveness',
 u'hermaphroditical']

#####
# looking at internet long word patterns
# more than 15 characters

# check which text number is internet chat, #5 
In [13]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create unique vocab set
In [14]: vocab = set(text5)

# iterate through vocab for words greater than 15 characters in length
In [15]: long_chat_words = [word for word in vocab if len(word) > 15]

# display first 10 sorted
In [16]: sorted(long_chat_words)[:10]
Out[16]:
[u'!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 u'#talkcity_adults',
 u'(((((((((((((((((',
 u'((((((((((((((((((',
 u'((((((((((((((((((((']

# display 101st to 110th sorted, no results 
In [17]: sorted(long_chat_words)[100:111]
Out[17]: []

# index from last for last 10
# observe exaggerated chat patterns
In [18]: sorted(long_chat_words)[-10:]
Out[18]:
[u'oooooooooooooonnnnnnnnnnnneeeeeeeeeeeeeeesssssssss',
 u'raaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 u'tatatatnanaantatat',
 u'weeeeeeeeeeeeeeee',
 u'weeeeeeeeeeeeeeeeeeeeeeeeed',
 u'wheeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 u'woooooooooaaaahhhhhhhhhhhh',
 u'wooooooooooooohoooooooooooooooo',
 u'www.Wunderground.com',
 u'yuuuuuuuuuuuummmmmmmmmmmm']

Furthermore, we use the Internet Chat corpus, text5, to examine some words with long length. The first few are simply exclamation points, while the last few are 'overspelled' for dramatic effect.


Word Length and Frequency
With word length, we could consider another attribute to select words from a corpus. We could use the word frequency. Even if a word is long winded, if it occurs more than a few times it could be indicative of an important word in the corpus. So we include the count from the frequency distribution. 

Looking at the Internet Chat corpus (text5) again, we select for the word length to be more than 8, and a frequency higher than 5. Keep in mind we have to create the frequency distribution from text5 first, and use the set of text5 as the iterable variable.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#####
# which words typify a text? 
# the long words or the single occurrences? (hapaxes)
# what about frequently occurring long words?

# display text names
In [19]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# create FreqDist object for text5
In [20]: fdist5 = FreqDist(text5)

# sort words iterated through set of text5
# having character length more than 8, and occurring more than 5 times
In [25]: selected_words = sorted([w for w in set(text5) if len(w) > 8 and fdist5[w] > 5])

# display words selected on minimum word length and occurrence
In [26]: selected_words
Out[26]:
[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u')))))))))))',
 u')))))))))))))',
 u'.........',
 u'Compliments',
 u'cute.-ass',
 u'everybody',
 u'everything',
 u'listening',
 u'seriously',
 u'something',
 u'sometimes']

The results vary in content, as Internet Chat is not censored, as you might discover if you delve deeper into the text.



Bigrams & Collocations


Here we arrive at the word pairs and special word pairs. The 'bigrams()' method creates pairings of words as it iterates through the text, combining adjacent words. Collocations pull those word pairs which exist together unusually frequently, and you might find that they have a particular meaning when seen together and are not descriptive when apart.

The bigram for 'more is said than done' is shown below. Note that the adjacent words are paired. For collocations, use the '.collocations()' method on a text corpus to retrieve the list of collocation terms. Looking at text4 Inaugural Address corpus, 'United States', 'fellow citizens', 'four years', and such are mentioned in those combinations more than not. 'Indian tribes', and 'Chief Justice', are terms which occur together.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# collocation : sequence of words that occur together unusually often
# ex: 'red wine', as opposed to 'the wine'
# bigram : word pairs 

# create bigram list
In [34]: bigram1 = list(bigrams(['more','is','said','than','done']))

# display bigram, note pairs of words
In [35]: bigram1
Out[35]: [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

# collocation for text4
# bigrams with words that occur together more frequently 
# than expected based on frequency of individual words
In [37]: text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

# collocations for personals corpus
In [38]: texts()
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

# display personals collocations
In [39]: text8.collocations()
Building collocations list
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

For the personals corpus, text8, we encounter word pairs in personal advertisements online. Likely and logical word combinations such as 'medium build', 'social drinker', 'quiet nights', easy going', 'financially secure', and 'permanent relationship' are paired together due to the nature of personality and lifestyle description. Simply put, those words go together, like peanut butter and jelly, although I am a 'peanut butter will do' person. The pair 'peanut butter' would be a collocation in a baking recipe or snacks corpus.


Wow, there goes another post! I know there was only one graph this time, so thanks for making it this far. Here we learned about frequency distributions, different ways to select words from a text corpus, and word selection tools to create bigrams and collocations in Python. This is just the beginning, and there is lots more text analysis to cover! So stay tuned.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Text Analysis Series:
1. Natural Language Processing in Python: Part 1. Texts
2. Natural Language Processing in Python: Part 2. Counting Vocabulary
3. Natural Language Processing in Python: Part 3. Indexing Lists
4. Natural Language Processing in Python: Part 4. Frequency Distributions, Word Selections, & Collocations
.