Pages

Saturday, April 18, 2015

Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price


Hello Readers,

Here we delve into a quick evaluation of quality metrics of the rankings of unlabeled reports. Previously we relied on labeled reports for the true fraud/non-fraud label for evaluation, and including the training set of reports, there will undoubtedly be unlabeled reports ranked in the near the top. Should those unlabeled reports be in the top in terms of likelihood for being fraudulent? That is where we use our predictive model to determine the classification of those unlabeled reports based on the the labeled reports.

One method compares the unit price of that report with that of the average unit price of transactions of that particular product. If the difference between the unit price for that transaction and the typical unit price transaction for that product is large- then that report is likely to belong in the top possible fraudulent transactions predicted by the model. Here we provide a method in evaluating the model performance.

(This is a series from Luis Torgo's  Data Mining with R book.)


Normalized Distance to Typical Price


To calculate the normalized distance to typical price (NDTP) for a specific product of a transaction (report), we take the difference of the unit price of that transaction and the overall unit median price of that product, and divide it by the inter-quartile range of that product's prices. 

Below is the code for calculating the NDTP. Starting on line 2, we define the 'avgNDTP' function to accept the arguments 'toInsp', 'train', and 'stats'. 'toInsp' contains the transaction data you want to inspect; 'train' contains raw transactions to obtain the median unit prices; 'stats' contains the pre-calculated typical unit prices for each product. This way, we can calling the 'avgNDTP' function repeatedly, sometimes with the pre-calculated stats is more efficient than calculating the stats each time. 

Normalized Distance to Typical Price Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> # normalized distances of typical prices ####
> avgNDTP <- function( toInsp, train, stats ) {
+   if(missing(train) && missing(stats))
+     stop('Provide either the training data or the product stats')
+   if(missing(stats)) {
+     notF <- which(train$Insp != 'fraud')
+     stats <- tapply(train$Uprice[notF],
+                     list(Prod=train$Prod[notF]),
+                     function(x) {
+                       bp <- boxplot.stats(x)$stats
+                       c(median=bp[3],iqr=bp[4]-bp[2])
+                     })
+     stats <- matrix(unlist(stats),
+                     length(stats), 2, byrow=T,
+                     dimnames=list(names(stats), c('median','iqr')))
+     stats[which(stats[,'iqr']==0),'iqr'] <-
+       stats[which(stats[,'iqr']==0),'median']
+   }
+   
+   mdtp <- mean(abs(toInsp$Uprice-stats[toInsp$Prod,'median']) /
+                  stats[toInsp$Prod,'iqr'])
+   return(mdtp)
+ }

The if statements check to see if sufficient arguments are present, and if the 'stats' input needs to be calculated. If so, line 6 begins to calculate the typical unit price for each product from those inspected transactions which are not labeled fraudulent. From there, we calculate the median and IQR starting on line 9. After unlisting and transforming the results, 'stats' into a matrix on line 13, we replace those with IQR=0 with their median value, because we cannot divide by zero.


On line 20, we create the variable 'mdtp' to hold the normalized distances from the typical price using the unit price from 'toInsp' and the median and IQR from the 'stats' provided or the 'stats' generated from the 'train' argument.

We will use this NDTP metric to evaluate and compare future predictive models, and seeing how well they identify fraudulent transactions given the few inspected transactions we have in our dataset.


So stay tuned for the upcoming predictive model posts!



As always, thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Saturday, March 28, 2015

Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall


Hello Readers,

Before we start creating classification models to predict fraudulent transactions, we need to understand how to evaluate model performance. Not only do we want to know if the particular transaction is predicted as fraudulent or not, we also want to know the confidence in that particular prediction (such as a probability of fraud).

To do this, we will use two measures, precision and recall. Beginning where we left off at the end of Part 3, start R and let us get started.

(This is a series from Luis Torgo's  Data Mining with R book.)


Precision & Recall


Because we have a limited number of transactions we can inspect, k transactions out of 401146 total transactions, we can determine the precision and recall of the model on those k transactions. Those k top transactions contain the most likely fraud candidates predicted by the model.

Precision:
Within those k transactions, the proportion of frauds among them; also known as positive predictive value

Recall:
Among the fraudulent transactions, the proportion of frauds in the k transactions; also known as sensitivity

There are trade offs when making decisions concerning the level of recall and precision. It could be easy to use a large number of k transactions to capture all of the fraudulent transactions, but that would result in low precision as there will be a larger proportion of 'good' transactions in the k top transactions. Given the feasibility of inspecting k transactions, we want to maximize our resources in the transaction inspection activities. So when we inspect x transactions in t hours, and we manage to capture all the frauds in those x transactions, we will have accomplished our task. Even if a large proportion of those x transactions were valid transactions, we would value high recall in this situation.



Performance Models


We now turn to evaluation of different inspection effort levels using different visual graphics. Some statistical models might be suited towards precision or towards recall. We would rank the class of interest (fraud) and interpolate the precision and recall values at different effort limits to determine those precision and recall values. Different effort limits will yield different precision and recall values, thus creating a visual performance model where we can see the optimal effort limits of high/low precision and recall values.


Using package "ROCR" that contains multiple functions for evaluating binary classifiers (yes/no, fraud/no fraud variables), including functions that calculate precision and recall, which we will then use to plot the performance model. Below, we load the "ROCR" library and the "ROCR.simple" data example with predictions in R. Next, in line 4, with the "prediction( )" function, we create "pred" using the "$predictions" and the true values in "$labels". Afterwards, we pass the "pred" prediction object to "performance( )" to obtain the precision and recall metrics. In line 5, we specify the "prec" and "rec" (precision and recall respectively) as arguments. 

R Code:

1
2
3
4
5
6
7
> # precision and recall functions
> library(ROCR)
> data(ROCR.simple)
> pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
> perf <- performance(pred, "prec", "rec")
> plot(perf)
> 

Plotting the "perf" object in line 6 will plot the below 'sawtooth' graphic.

Figure 1. Sawtooth Precision and Recall

As you can observe, at high levels of precision, the recall is not always high, and the same for high levels of recall. Note the axis limits of (0.5, 1.0) precision on the Y, and the (0.0, 1.0) range for recall on the X axis. This means that for 100% recall, where all the positives are captured, the precision (proportion of positives in the sample results) falls below 0.5,  due to capturing non-positives as well. Whereas for 100% precision, the recall falls to near zero, as not that many of the total positives are captured by the test. However, note the top right point, where the precision and recall both reach above 0.80- that is a good trade-off point if we were to take into account both precision and recall.

Evidently, the 'sawtooth' graph is not smooth at all. There is a method where we use the highest achieved precision level as we increase in recall value to smooth the curve. This interpolation of the precision takes the highest value of precision for a certain value of recall. As the value of recall increases, the maximum precision changes (decreases) to the maximum precision of precision values at greater recall values. This is described by the formula below, where r is the recall value, and r' are those values greater than r:


Precision Interpolation

We show this smoothing by accessing the y values (precision) from the "performance( )" object.  Below is the R code for a function that plots the precision-recall curve with interpolation of the precision values. Line 3 checks if the "ROCR" package is loaded, and lines 4 and 5 are familiar, as they create the "prediction( )" and "performance( )" objects.

R Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
> # precision smoothing
> PRcurve <- function(preds, trues, ...) {
+   require(ROCR, quietly=T)
+   pd <- prediction(preds, trues)
+   pf <- performance(pd, "prec","rec")
+   pf@y.values <- lapply(pf@y.values, function(x) rev(cummax(rev(x))))
+   plot(pf, ...)
+ }
> 
> # plot both graphics side by side
> par(mfrow=c(1,2))
> plot(perf)
> PRcurve(ROCR.simple$predictions, ROCR.simple$labels)
> 

Line 6 accesses the "pf@y.values" and takes the cumulative maximum value of the reverse ordered y values, reverses them back, and replaces the original y values. The "cummax( )" function returns a vector with the maximum value in order by index (first, second, third value). By reversing the y values, we start decreasing at recall 1.0, and take the maximum value until we get another higher precision value, and reverse the order afterwards. Starting from line 11, we plot the two curves side by side (1 row, 2 columns) by calling the previous 'sawtooth' "perf" object and the "PRcurve" function.



Figure 2. 'Sawtooth' and Interpolated Precision Precision-Recall Curves

Note how the smooth curve has all the highest possible precision values at increasing recall values. Also, in line 6, we use "lapply( )" to apply the function over the y values so we can pass multiple sets of y values with this "PRcurve" function.



Fraudulent Transactions



So how does this apply to inspecting frauds? We know from the beginning that we aim to increase our recall metric to capture as many total frauds as possible for optimal efficiency. Last post we talked about transaction outliers and their outlier score. With the outlier scores, we can set a limit by establishing a threshold for the outlier score where a transaction is predicted as a fraud or non-fraudulent. That way, we have the predicted values. Additionally, we can compare the inspected results when we run the inspected values through the model and compare them with the predicted fraud status and the inspected status for precision and recall values.



Lift Charts


Lift charts provide more emphasis on recall, and will be more applicable to evaluating fraudulent model transactions. These charts are different as they include the rate of positive predictions (RPP) on the X axis and the Y axis displays the recall value divided by the RPP. The RPP is the number of positive class predictions divided by the total number of test cases. For example, if there are 5 frauds predicted (though not all might be true frauds) out of 20 test cases, then the RPP is 0.25.

Again, the "ROCR" package includes the necessary functions to obtain the lift chart metrics.   We still use the "$predictions" and "$labels", but now use "lift" and "rpp" as our Y and X axes in the "performance( )" function in line 3. For the cumulative recall function, "CRchart", we use the same format as the "PRcurve" function but similarly substitute in "rec" for recall and "rpp" for rate of positive predictions.

R Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
> # lift chart
> pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
> perf <- performance(pred, "lift", "rpp")
> par(mfrow=c(1,2))
> plot(perf, main="Lift Chart")
> 
> # cumulative chart with rate of positive predictions
> CRchart <- function( preds, trues, ...) {
+   require(ROCR, quietly=T)
+   pd <- prediction(preds, trues)
+   pf <- performance(pd, "rec", "rpp")
+   plot(pf, ...)
+ }
> CRchart(ROCR.simple$predictions, ROCR.simple$labels,
+         main='Cumulative Recall Chart')
> 

This yields two graphs side by side:



Figure 3. Lift Chart and Cumulative Recall Chart

The more close the cumulative recall curve is to the top left corner of the graph, the better the indication from the model. While lift charts contain the comparison of recall and RPP, the cumulative recall and RPP curve applies more here. It shows the recall value with changing inspection effort, where the number of cases are tested and determined to be frauds. At 1.0, all cases were tested, and therefore, all frauds were captured, showing 1.0 recall.


Next we will be exploring normalized distance to obtain the outlier score with which we will determine the threshold and evaluate the many models' recall metrics. Stay tuned, folks!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Tuesday, January 27, 2015

Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers


Hello Readers,

Welcome back! In this post we continue our case study on Detecting Fraudulent Transactions. In Part 2, we cleaned/imputed the 'Sales' dataset of missing values, but we still have some wild 'Quantity' and 'Value' variables which are present in the data. In this post we will address handling these pesky outliers, but not exactly in terms of large numbers. Think smaller.

Ready? Start R and let's go.

(This is a series from Luis Torgo's  Data Mining with R book.)

Those Pesky Outliers


When we first think of outliers, we might picture transactions with products being sold at enormous quantities, or products being sold with eye-brow-raising prices. $1,000 for each unit? Is that legit or a typo? But keeping in mind our goal of predicting transactions from a training set with inspected transactions, we need to be aware of the low number of manually inspected transactions ("ok" n=14,462, "fraud" n=1,270, vs "unkn" n=385,414) we will use to create a model to predict the un-inspected transactions (over 96% of the transactions). It turns out that there are 985 products with less than 20 transactions! Therefore, we should be keeping an eye on those products with few transactions, as well as concentrating on products with an average number of transactions, but with crazy 'Quantity' or 'Value' counts. 

Robust Measures

Load the 'salesClean.rdata' that we created in Part 2. Our plan of attack will use robust measures that are not sensitive to outliers, such as the median and inner-quartile range (IQR) of the unit price for each product, using the 'tapply()' function. Be sure to use only the transactions that are not labeled 'fraud', or else our averages would not be accurate. The we implement the 'tapply()' function to grab the median and quartile variables from 'boxplot.stats()'. Then we take the list output from 'tapply()' and transform it into a matrix, after we 'unlist()' the output.

Robust Measures:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> load('salesClean.rdata')
> attach(sales)
> # find non-fraud indexes
> notF <- which(Insp != 'fraud')
> # median and IQR of unit-price for each product
> ms <- tapply(Uprice[notF], list(Prod=Prod[notF]), 
+              function(x) {
+                bp <- boxplot.stats(x)$stats
+                # returns median and iqr = 75% - 25% percentiles
+                c(median=bp[3],iqr=bp[4]-bp[2])
+              })
> # transforms into a matrix
> # with median unit price value for each product, along with iqr
> ms <- matrix(unlist(ms),
+              length(ms), 2, byrow=T, 
+              dimnames=list(names(ms), c('median', 'iqr')))
> head(ms)
      median      iqr
p1 11.346154 8.575599
p2 10.877863 5.609731
p3 10.000000 4.809092
p4  9.911243 5.998530
p5 10.957447 7.136601
p6 13.223684 6.685185
> 

Now we have a matrix of the median and IQR unit price for each product using non-fraud transactions. Note the printed median and IQR values for the first 6 products. Below, we plot the median and IQR unit prices twice, unscaled on the left, and in log scale on the right to accommodate the extreme values. Because the log plot on the right will have a more even distribution of values, we will plot them in grey, and later overlay them with unit price points of products with less than 20 transactions.


Plotting Measures:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
> # plot the median and unit price for each product
> # also show log of plot to accomodate the extreme values
> par(mfrow= c(1,2))
> plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="")
> # now grey log points
> plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="",
+      col="grey", "log"="xy")
>
> # which are the few products less than 20 transactions
> smalls <- which(table(Prod) < 20)
> # draw the black +'s over the grey log points 
#>  to highlight products with few transactions
> points(log(ms[smalls,1]), log(ms[smalls,2]), pch="+")

Below, we see why the only log plot on the right needed to be grey. The extreme value on top-right of the unscaled plot completely relegates the rest of the points in a dense cluster.  We would not have been able to see the overlay points anyways. However, with the log transform, the distribution of robust measures of unit prices is more clearly visualized. The black points representing products with less than 20 transactions fall in line with the rest of products.


Figure 1. Plots of Median and IQR Product Values in 2 Scales
We observe that many products have similar medians and IQR spreads, which is good for those products with fewer transactions. Even so, we need to examine the variables of those products with few transactions more closely.

Products with Few Transactions

We might assess outlier transactions from products with few transactions by grouping them together with products with similar distributions to gain statistical significance. From the above plots, there are many products with similar medians and IQR spreads. That is why we earlier obtained the median and IQR values for each product, both resistant to extreme outlier values. We assume that the unit-price for a product is normally distributed around the middle median value, with spread of the IQR.


However not all of the fewer transaction products have distributions similar to other 'normal' products. So we will have more difficulty determining whether those transactions are fraudulent or not with statistical confidence. 

Following the robust measures theme, we proceed with a robust nonparametric test of the similarity of distributions, the Kolmogorov-Smirnov test. The K-S test informs us of the null hypothesis that two samples come from the same distribution. The statistic we obtain from the K-S test gives us the maximum difference between two empirical cumulative distribution functions.


Running the K-S Test



Using the 'ms' matrix object with the median and IQR values we generated earlier, we scale 'ms' to 'dms', then create 'smalls' which contains the integer index of our scarce products. The 'prods' list contains the unit price for each transaction for each product. Then we finish preparation by generating an empty NA matrix with rows the length of 'smalls', and 7 columns for the resulting statistics.

Beginning in line 23 we start to loop through the scarce products. The first operation on line 24 is crucial. We use 'scale()' to subtract the entire 'dms' by each row with the median and IQR values for the ith scarce product. That way each scarce product is compared to all the other products. Line 26 removes any negative values and multiplies the matrices together to form a 'difference' value, and in line 29 we run the K-S test with 'ks.test()',  comparing the unit-prices of that ith scarce product to the second smallest difference value. Why the second smallest? Because the smallest difference would be zero, or the same product median and IQR values as itself- remember the scaling in 'd' applies to all the products. To record the results of the ith product, we store it in the ith row, in the 'similar' matrix we created, which happens to have the number of rows as number of scarce products. We store the integer of similar product in the first column, the K-S statistic in the second, and then storing the unscaled median and IQR values for the scarce product and the similar product.

Finding Similar Products:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
> # K-S Test for similar transactions ####
> # for products with few transactions
> # create matrix table with similar product
> # from the Kolmogorov-Smirnov test
> # includes KS statistic, pvalue
> # and median and iqr for both product and similar Product
> 
> dms <- scale(ms)
> smalls <- which(table(Prod) < 20)
> prods <- tapply(sales$Uprice, sales$Prod, list)
> # create emtpy similar matrix, fit for few product #rows and 7 variables
> similar <- matrix(NA, length(smalls), 7,
+   # add row and column names
+   dimnames=list(names(smalls),
+   c("Similar", "ks.stat", "ks.p",
+   "medianP", "iqrP",
+   "medianSim", "iqrSim")))
> # iterate through each row in the median iqr Uprice matrix
> # with few transaction products only.
> # scale to all to current row to find most similar product
> # using the KS test in all products.
> # does not iterate through all products in dms.
> for(i in seq(along=smalls)) {
+   d <- scale(dms, dms[smalls[i], ], FALSE)
+   # removes negatives through matrix multiplication
+   d <- sqrt(drop(d^2 %*% rep(1, ncol(d))))
+   # ks test of current product and next best product with lowest difference
+   # because best product is itself
+   stat <- ks.test(prods[[smalls[i]]], prods[[order(d)[2]]])
+   # add results to the similar matrix:
+   # similar product, KS statistic, KS pval,
+   # product values, similar product values (median, iqr)
+   similar[i, ] <- c(order(d)[2], stat$statistic, stat$p.value,
+   ms[smalls[i], ], ms[order(d)[2], ])
+ }
> 
> head(similar) # so first product p8's similar product is p2829
    Similar   ks.stat       ks.p  medianP      iqrP medianSim    iqrSim
p8     2827 0.4339623 0.06470603 3.850211 0.7282168  3.868306 0.7938557
p18     213 0.2568922 0.25815859 5.187266 8.0359968  5.274884 7.8894149
p38    1044 0.3650794 0.11308315 5.490758 6.4162095  5.651818 6.3248073
p39    1540 0.2258065 0.70914769 7.986486 1.6425959  8.080694 1.7668724
p40    3971 0.3333333 0.13892028 9.674797 1.6104511  9.668854 1.6520147
p47    1387 0.3125000 0.48540576 2.504092 2.5625835  2.413498 2.6402087
> # confirm using levels
> levels(Prod)[similar[1,1]]
[1] "p2829"

We see from the results in 'similar' that the most similar product to first scarce product, 'p8', is 'p2827'. A quick check at line 46 verifies the matched product ID. It is nearly statistically significant (small p-value = 0.0647), which implies that there is a small probability of the differences occurring by chance- that while 'p2827' is the most similar, it can nearly reject the null hypothesis of the K-S test where the two samples come from the same distribution. We want the similar products to be from similar distributions to that of scarce products so we can group them together as similar products in order to overcome their few number of transactions. 
So we are looking for p-values at the other extreme, close to 1, which indicates that the two distributions are (nearly) equal.

Evaluation of Valid Products


Now that we have the 'similar' matrix, we can examine how many products have similar products sufficiently similar in distribution within a 90% confidence interval. Therefore, we rely on a p-value of 0.9, because we want to be as close to the null of equal distributions as possible. So below, we take the sum of the row-wise logic operation.

Viable Products:

1
2
3
4
5
6
7
8
> # check how many products have unit price distribution that is
> # significantly similar with 90% CI with KS value being 0.9 or greater
> sum(similar[, "ks.p"] >= 0.9) # 117
[1] 117
> dim(similar)
[1] 985   7
> save(similar, file="similarProducts.Rdata")
> 

Observe 117 products have sufficiently similar products, out of 985 products with less than 20 transactions. So while we can match almost 12% of the scarce products with another product, there are still 868 products with less than 20 transactions. However do not despair! Though we have quite a few we were not able to match, we did capture 117 products which we would have otherwise had trouble obtaining accurate fraudulent predictions later. (Remember to save the 'similar' object!)


Summary

Here we explored tracking non-traditional outliers in terms of products few number of transactions for the future purpose of more accurate fraud prediction. We used the Kolmogorov-Smirnov test to evaluate and find other products with most similar distributions using robust measures of median and IQR. The results show that 117 of the 985 scarce products have similar transactions with 90% CI. 

Next we will discuss the evaluation criteria of classifying fraudulent transactions from a predictive model. How do we know how well the model performed? How does one model compare to another? Stay tuned!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.