## Tuesday, January 27, 2015

### Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers

Welcome back! In this post we continue our case study on Detecting Fraudulent Transactions. In Part 2, we cleaned/imputed the 'Sales' dataset of missing values, but we still have some wild 'Quantity' and 'Value' variables which are present in the data. In this post we will address handling these pesky outliers, but not exactly in terms of large numbers. Think smaller.

Ready? Start R and let's go.

(This is a series from Luis Torgo's  Data Mining with R book.)

### Those Pesky Outliers

When we first think of outliers, we might picture transactions with products being sold at enormous quantities, or products being sold with eye-brow-raising prices. \$1,000 for each unit? Is that legit or a typo? But keeping in mind our goal of predicting transactions from a training set with inspected transactions, we need to be aware of the low number of manually inspected transactions ("ok" n=14,462, "fraud" n=1,270, vs "unkn" n=385,414) we will use to create a model to predict the un-inspected transactions (over 96% of the transactions). It turns out that there are 985 products with less than 20 transactions! Therefore, we should be keeping an eye on those products with few transactions, as well as concentrating on products with an average number of transactions, but with crazy 'Quantity' or 'Value' counts.

Robust Measures

Load the 'salesClean.rdata' that we created in Part 2. Our plan of attack will use robust measures that are not sensitive to outliers, such as the median and inner-quartile range (IQR) of the unit price for each product, using the 'tapply()' function. Be sure to use only the transactions that are not labeled 'fraud', or else our averages would not be accurate. The we implement the 'tapply()' function to grab the median and quartile variables from 'boxplot.stats()'. Then we take the list output from 'tapply()' and transform it into a matrix, after we 'unlist()' the output.

Robust Measures:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25``` ```> load('salesClean.rdata') > attach(sales) > # find non-fraud indexes > notF <- which(Insp != 'fraud') > # median and IQR of unit-price for each product > ms <- tapply(Uprice[notF], list(Prod=Prod[notF]), + function(x) { + bp <- boxplot.stats(x)\$stats + # returns median and iqr = 75% - 25% percentiles + c(median=bp[3],iqr=bp[4]-bp[2]) + }) > # transforms into a matrix > # with median unit price value for each product, along with iqr > ms <- matrix(unlist(ms), + length(ms), 2, byrow=T, + dimnames=list(names(ms), c('median', 'iqr'))) > head(ms) median iqr p1 11.346154 8.575599 p2 10.877863 5.609731 p3 10.000000 4.809092 p4 9.911243 5.998530 p5 10.957447 7.136601 p6 13.223684 6.685185 > ```

Now we have a matrix of the median and IQR unit price for each product using non-fraud transactions. Note the printed median and IQR values for the first 6 products. Below, we plot the median and IQR unit prices twice, unscaled on the left, and in log scale on the right to accommodate the extreme values. Because the log plot on the right will have a more even distribution of values, we will plot them in grey, and later overlay them with unit price points of products with less than 20 transactions.

Plotting Measures:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ```> # plot the median and unit price for each product > # also show log of plot to accomodate the extreme values > par(mfrow= c(1,2)) > plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="") > # now grey log points > plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="", + col="grey", "log"="xy") > > # which are the few products less than 20 transactions > smalls <- which(table(Prod) < 20) > # draw the black +'s over the grey log points #> to highlight products with few transactions > points(log(ms[smalls,1]), log(ms[smalls,2]), pch="+") ```

Below, we see why the only log plot on the right needed to be grey. The extreme value on top-right of the unscaled plot completely relegates the rest of the points in a dense cluster.  We would not have been able to see the overlay points anyways. However, with the log transform, the distribution of robust measures of unit prices is more clearly visualized. The black points representing products with less than 20 transactions fall in line with the rest of products.

 Figure 1. Plots of Median and IQR Product Values in 2 Scales
We observe that many products have similar medians and IQR spreads, which is good for those products with fewer transactions. Even so, we need to examine the variables of those products with few transactions more closely.

Products with Few Transactions

We might assess outlier transactions from products with few transactions by grouping them together with products with similar distributions to gain statistical significance. From the above plots, there are many products with similar medians and IQR spreads. That is why we earlier obtained the median and IQR values for each product, both resistant to extreme outlier values. We assume that the unit-price for a product is normally distributed around the middle median value, with spread of the IQR.

However not all of the fewer transaction products have distributions similar to other 'normal' products. So we will have more difficulty determining whether those transactions are fraudulent or not with statistical confidence.

Following the robust measures theme, we proceed with a robust nonparametric test of the similarity of distributions, the Kolmogorov-Smirnov test. The K-S test informs us of the null hypothesis that two samples come from the same distribution. The statistic we obtain from the K-S test gives us the maximum difference between two empirical cumulative distribution functions.

### Running the K-S Test

Using the 'ms' matrix object with the median and IQR values we generated earlier, we scale 'ms' to 'dms', then create 'smalls' which contains the integer index of our scarce products. The 'prods' list contains the unit price for each transaction for each product. Then we finish preparation by generating an empty NA matrix with rows the length of 'smalls', and 7 columns for the resulting statistics.

Beginning in line 23 we start to loop through the scarce products. The first operation on line 24 is crucial. We use 'scale()' to subtract the entire 'dms' by each row with the median and IQR values for the ith scarce product. That way each scarce product is compared to all the other products. Line 26 removes any negative values and multiplies the matrices together to form a 'difference' value, and in line 29 we run the K-S test with 'ks.test()',  comparing the unit-prices of that ith scarce product to the second smallest difference value. Why the second smallest? Because the smallest difference would be zero, or the same product median and IQR values as itself- remember the scaling in 'd' applies to all the products. To record the results of the ith product, we store it in the ith row, in the 'similar' matrix we created, which happens to have the number of rows as number of scarce products. We store the integer of similar product in the first column, the K-S statistic in the second, and then storing the unscaled median and IQR values for the scarce product and the similar product.

Finding Similar Products:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47``` ```> # K-S Test for similar transactions #### > # for products with few transactions > # create matrix table with similar product > # from the Kolmogorov-Smirnov test > # includes KS statistic, pvalue > # and median and iqr for both product and similar Product > > dms <- scale(ms) > smalls <- which(table(Prod) < 20) > prods <- tapply(sales\$Uprice, sales\$Prod, list) > # create emtpy similar matrix, fit for few product #rows and 7 variables > similar <- matrix(NA, length(smalls), 7, + # add row and column names + dimnames=list(names(smalls), + c("Similar", "ks.stat", "ks.p", + "medianP", "iqrP", + "medianSim", "iqrSim"))) > # iterate through each row in the median iqr Uprice matrix > # with few transaction products only. > # scale to all to current row to find most similar product > # using the KS test in all products. > # does not iterate through all products in dms. > for(i in seq(along=smalls)) { + d <- scale(dms, dms[smalls[i], ], FALSE) + # removes negatives through matrix multiplication + d <- sqrt(drop(d^2 %*% rep(1, ncol(d)))) + # ks test of current product and next best product with lowest difference + # because best product is itself + stat <- ks.test(prods[[smalls[i]]], prods[[order(d)[2]]]) + # add results to the similar matrix: + # similar product, KS statistic, KS pval, + # product values, similar product values (median, iqr) + similar[i, ] <- c(order(d)[2], stat\$statistic, stat\$p.value, + ms[smalls[i], ], ms[order(d)[2], ]) + } > > head(similar) # so first product p8's similar product is p2829 Similar ks.stat ks.p medianP iqrP medianSim iqrSim p8 2827 0.4339623 0.06470603 3.850211 0.7282168 3.868306 0.7938557 p18 213 0.2568922 0.25815859 5.187266 8.0359968 5.274884 7.8894149 p38 1044 0.3650794 0.11308315 5.490758 6.4162095 5.651818 6.3248073 p39 1540 0.2258065 0.70914769 7.986486 1.6425959 8.080694 1.7668724 p40 3971 0.3333333 0.13892028 9.674797 1.6104511 9.668854 1.6520147 p47 1387 0.3125000 0.48540576 2.504092 2.5625835 2.413498 2.6402087 > # confirm using levels > levels(Prod)[similar[1,1]] [1] "p2829" ```

We see from the results in 'similar' that the most similar product to first scarce product, 'p8', is 'p2827'. A quick check at line 46 verifies the matched product ID. It is nearly statistically significant (small p-value = 0.0647), which implies that there is a small probability of the differences occurring by chance- that while 'p2827' is the most similar, it can nearly reject the null hypothesis of the K-S test where the two samples come from the same distribution. We want the similar products to be from similar distributions to that of scarce products so we can group them together as similar products in order to overcome their few number of transactions.
So we are looking for p-values at the other extreme, close to 1, which indicates that the two distributions are (nearly) equal.

Evaluation of Valid Products

Now that we have the 'similar' matrix, we can examine how many products have similar products sufficiently similar in distribution within a 90% confidence interval. Therefore, we rely on a p-value of 0.9, because we want to be as close to the null of equal distributions as possible. So below, we take the sum of the row-wise logic operation.

Viable Products:

 ```1 2 3 4 5 6 7 8``` ```> # check how many products have unit price distribution that is > # significantly similar with 90% CI with KS value being 0.9 or greater > sum(similar[, "ks.p"] >= 0.9) # 117 [1] 117 > dim(similar) [1] 985 7 > save(similar, file="similarProducts.Rdata") > ```

Observe 117 products have sufficiently similar products, out of 985 products with less than 20 transactions. So while we can match almost 12% of the scarce products with another product, there are still 868 products with less than 20 transactions. However do not despair! Though we have quite a few we were not able to match, we did capture 117 products which we would have otherwise had trouble obtaining accurate fraudulent predictions later. (Remember to save the 'similar' object!)

Summary

Here we explored tracking non-traditional outliers in terms of products few number of transactions for the future purpose of more accurate fraud prediction. We used the Kolmogorov-Smirnov test to evaluate and find other products with most similar distributions using robust measures of median and IQR. The results show that 117 of the 985 scarce products have similar transactions with 90% CI.

Next we will discuss the evaluation criteria of classifying fraudulent transactions from a predictive model. How do we know how well the model performed? How does one model compare to another? Stay tuned!