Loading...
Showing posts with label transactions. Show all posts
Showing posts with label transactions. Show all posts

Saturday, April 18, 2015

Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price


Hello Readers,

Here we delve into a quick evaluation of quality metrics of the rankings of unlabeled reports. Previously we relied on labeled reports for the true fraud/non-fraud label for evaluation, and including the training set of reports, there will undoubtedly be unlabeled reports ranked in the near the top. Should those unlabeled reports be in the top in terms of likelihood for being fraudulent? That is where we use our predictive model to determine the classification of those unlabeled reports based on the the labeled reports.

One method compares the unit price of that report with that of the average unit price of transactions of that particular product. If the difference between the unit price for that transaction and the typical unit price transaction for that product is large- then that report is likely to belong in the top possible fraudulent transactions predicted by the model. Here we provide a method in evaluating the model performance.

(This is a series from Luis Torgo's  Data Mining with R book.)


Normalized Distance to Typical Price


To calculate the normalized distance to typical price (NDTP) for a specific product of a transaction (report), we take the difference of the unit price of that transaction and the overall unit median price of that product, and divide it by the inter-quartile range of that product's prices. 

Below is the code for calculating the NDTP. Starting on line 2, we define the 'avgNDTP' function to accept the arguments 'toInsp', 'train', and 'stats'. 'toInsp' contains the transaction data you want to inspect; 'train' contains raw transactions to obtain the median unit prices; 'stats' contains the pre-calculated typical unit prices for each product. This way, we can calling the 'avgNDTP' function repeatedly, sometimes with the pre-calculated stats is more efficient than calculating the stats each time. 

Normalized Distance to Typical Price Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> # normalized distances of typical prices ####
> avgNDTP <- function( toInsp, train, stats ) {
+   if(missing(train) && missing(stats))
+     stop('Provide either the training data or the product stats')
+   if(missing(stats)) {
+     notF <- which(train$Insp != 'fraud')
+     stats <- tapply(train$Uprice[notF],
+                     list(Prod=train$Prod[notF]),
+                     function(x) {
+                       bp <- boxplot.stats(x)$stats
+                       c(median=bp[3],iqr=bp[4]-bp[2])
+                     })
+     stats <- matrix(unlist(stats),
+                     length(stats), 2, byrow=T,
+                     dimnames=list(names(stats), c('median','iqr')))
+     stats[which(stats[,'iqr']==0),'iqr'] <-
+       stats[which(stats[,'iqr']==0),'median']
+   }
+   
+   mdtp <- mean(abs(toInsp$Uprice-stats[toInsp$Prod,'median']) /
+                  stats[toInsp$Prod,'iqr'])
+   return(mdtp)
+ }

The if statements check to see if sufficient arguments are present, and if the 'stats' input needs to be calculated. If so, line 6 begins to calculate the typical unit price for each product from those inspected transactions which are not labeled fraudulent. From there, we calculate the median and IQR starting on line 9. After unlisting and transforming the results, 'stats' into a matrix on line 13, we replace those with IQR=0 with their median value, because we cannot divide by zero.


On line 20, we create the variable 'mdtp' to hold the normalized distances from the typical price using the unit price from 'toInsp' and the median and IQR from the 'stats' provided or the 'stats' generated from the 'train' argument.

We will use this NDTP metric to evaluate and compare future predictive models, and seeing how well they identify fraudulent transactions given the few inspected transactions we have in our dataset.


So stay tuned for the upcoming predictive model posts!



As always, thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Saturday, March 28, 2015

Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall


Hello Readers,

Before we start creating classification models to predict fraudulent transactions, we need to understand how to evaluate model performance. Not only do we want to know if the particular transaction is predicted as fraudulent or not, we also want to know the confidence in that particular prediction (such as a probability of fraud).

To do this, we will use two measures, precision and recall. Beginning where we left off at the end of Part 3, start R and let us get started.

(This is a series from Luis Torgo's  Data Mining with R book.)


Precision & Recall


Because we have a limited number of transactions we can inspect, k transactions out of 401146 total transactions, we can determine the precision and recall of the model on those k transactions. Those k top transactions contain the most likely fraud candidates predicted by the model.

Precision:
Within those k transactions, the proportion of frauds among them; also known as positive predictive value

Recall:
Among the fraudulent transactions, the proportion of frauds in the k transactions; also known as sensitivity

There are trade offs when making decisions concerning the level of recall and precision. It could be easy to use a large number of k transactions to capture all of the fraudulent transactions, but that would result in low precision as there will be a larger proportion of 'good' transactions in the k top transactions. Given the feasibility of inspecting k transactions, we want to maximize our resources in the transaction inspection activities. So when we inspect x transactions in t hours, and we manage to capture all the frauds in those x transactions, we will have accomplished our task. Even if a large proportion of those x transactions were valid transactions, we would value high recall in this situation.



Performance Models


We now turn to evaluation of different inspection effort levels using different visual graphics. Some statistical models might be suited towards precision or towards recall. We would rank the class of interest (fraud) and interpolate the precision and recall values at different effort limits to determine those precision and recall values. Different effort limits will yield different precision and recall values, thus creating a visual performance model where we can see the optimal effort limits of high/low precision and recall values.


Using package "ROCR" that contains multiple functions for evaluating binary classifiers (yes/no, fraud/no fraud variables), including functions that calculate precision and recall, which we will then use to plot the performance model. Below, we load the "ROCR" library and the "ROCR.simple" data example with predictions in R. Next, in line 4, with the "prediction( )" function, we create "pred" using the "$predictions" and the true values in "$labels". Afterwards, we pass the "pred" prediction object to "performance( )" to obtain the precision and recall metrics. In line 5, we specify the "prec" and "rec" (precision and recall respectively) as arguments. 

R Code:

1
2
3
4
5
6
7
> # precision and recall functions
> library(ROCR)
> data(ROCR.simple)
> pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
> perf <- performance(pred, "prec", "rec")
> plot(perf)
> 

Plotting the "perf" object in line 6 will plot the below 'sawtooth' graphic.

Figure 1. Sawtooth Precision and Recall

As you can observe, at high levels of precision, the recall is not always high, and the same for high levels of recall. Note the axis limits of (0.5, 1.0) precision on the Y, and the (0.0, 1.0) range for recall on the X axis. This means that for 100% recall, where all the positives are captured, the precision (proportion of positives in the sample results) falls below 0.5,  due to capturing non-positives as well. Whereas for 100% precision, the recall falls to near zero, as not that many of the total positives are captured by the test. However, note the top right point, where the precision and recall both reach above 0.80- that is a good trade-off point if we were to take into account both precision and recall.

Evidently, the 'sawtooth' graph is not smooth at all. There is a method where we use the highest achieved precision level as we increase in recall value to smooth the curve. This interpolation of the precision takes the highest value of precision for a certain value of recall. As the value of recall increases, the maximum precision changes (decreases) to the maximum precision of precision values at greater recall values. This is described by the formula below, where r is the recall value, and r' are those values greater than r:


Precision Interpolation

We show this smoothing by accessing the y values (precision) from the "performance( )" object.  Below is the R code for a function that plots the precision-recall curve with interpolation of the precision values. Line 3 checks if the "ROCR" package is loaded, and lines 4 and 5 are familiar, as they create the "prediction( )" and "performance( )" objects.

R Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
> # precision smoothing
> PRcurve <- function(preds, trues, ...) {
+   require(ROCR, quietly=T)
+   pd <- prediction(preds, trues)
+   pf <- performance(pd, "prec","rec")
+   pf@y.values <- lapply(pf@y.values, function(x) rev(cummax(rev(x))))
+   plot(pf, ...)
+ }
> 
> # plot both graphics side by side
> par(mfrow=c(1,2))
> plot(perf)
> PRcurve(ROCR.simple$predictions, ROCR.simple$labels)
> 

Line 6 accesses the "pf@y.values" and takes the cumulative maximum value of the reverse ordered y values, reverses them back, and replaces the original y values. The "cummax( )" function returns a vector with the maximum value in order by index (first, second, third value). By reversing the y values, we start decreasing at recall 1.0, and take the maximum value until we get another higher precision value, and reverse the order afterwards. Starting from line 11, we plot the two curves side by side (1 row, 2 columns) by calling the previous 'sawtooth' "perf" object and the "PRcurve" function.



Figure 2. 'Sawtooth' and Interpolated Precision Precision-Recall Curves

Note how the smooth curve has all the highest possible precision values at increasing recall values. Also, in line 6, we use "lapply( )" to apply the function over the y values so we can pass multiple sets of y values with this "PRcurve" function.



Fraudulent Transactions



So how does this apply to inspecting frauds? We know from the beginning that we aim to increase our recall metric to capture as many total frauds as possible for optimal efficiency. Last post we talked about transaction outliers and their outlier score. With the outlier scores, we can set a limit by establishing a threshold for the outlier score where a transaction is predicted as a fraud or non-fraudulent. That way, we have the predicted values. Additionally, we can compare the inspected results when we run the inspected values through the model and compare them with the predicted fraud status and the inspected status for precision and recall values.



Lift Charts


Lift charts provide more emphasis on recall, and will be more applicable to evaluating fraudulent model transactions. These charts are different as they include the rate of positive predictions (RPP) on the X axis and the Y axis displays the recall value divided by the RPP. The RPP is the number of positive class predictions divided by the total number of test cases. For example, if there are 5 frauds predicted (though not all might be true frauds) out of 20 test cases, then the RPP is 0.25.

Again, the "ROCR" package includes the necessary functions to obtain the lift chart metrics.   We still use the "$predictions" and "$labels", but now use "lift" and "rpp" as our Y and X axes in the "performance( )" function in line 3. For the cumulative recall function, "CRchart", we use the same format as the "PRcurve" function but similarly substitute in "rec" for recall and "rpp" for rate of positive predictions.

R Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
> # lift chart
> pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
> perf <- performance(pred, "lift", "rpp")
> par(mfrow=c(1,2))
> plot(perf, main="Lift Chart")
> 
> # cumulative chart with rate of positive predictions
> CRchart <- function( preds, trues, ...) {
+   require(ROCR, quietly=T)
+   pd <- prediction(preds, trues)
+   pf <- performance(pd, "rec", "rpp")
+   plot(pf, ...)
+ }
> CRchart(ROCR.simple$predictions, ROCR.simple$labels,
+         main='Cumulative Recall Chart')
> 

This yields two graphs side by side:



Figure 3. Lift Chart and Cumulative Recall Chart

The more close the cumulative recall curve is to the top left corner of the graph, the better the indication from the model. While lift charts contain the comparison of recall and RPP, the cumulative recall and RPP curve applies more here. It shows the recall value with changing inspection effort, where the number of cases are tested and determined to be frauds. At 1.0, all cases were tested, and therefore, all frauds were captured, showing 1.0 recall.


Next we will be exploring normalized distance to obtain the outlier score with which we will determine the threshold and evaluate the many models' recall metrics. Stay tuned, folks!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Tuesday, January 27, 2015

Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers


Hello Readers,

Welcome back! In this post we continue our case study on Detecting Fraudulent Transactions. In Part 2, we cleaned/imputed the 'Sales' dataset of missing values, but we still have some wild 'Quantity' and 'Value' variables which are present in the data. In this post we will address handling these pesky outliers, but not exactly in terms of large numbers. Think smaller.

Ready? Start R and let's go.

(This is a series from Luis Torgo's  Data Mining with R book.)

Those Pesky Outliers


When we first think of outliers, we might picture transactions with products being sold at enormous quantities, or products being sold with eye-brow-raising prices. $1,000 for each unit? Is that legit or a typo? But keeping in mind our goal of predicting transactions from a training set with inspected transactions, we need to be aware of the low number of manually inspected transactions ("ok" n=14,462, "fraud" n=1,270, vs "unkn" n=385,414) we will use to create a model to predict the un-inspected transactions (over 96% of the transactions). It turns out that there are 985 products with less than 20 transactions! Therefore, we should be keeping an eye on those products with few transactions, as well as concentrating on products with an average number of transactions, but with crazy 'Quantity' or 'Value' counts. 

Robust Measures

Load the 'salesClean.rdata' that we created in Part 2. Our plan of attack will use robust measures that are not sensitive to outliers, such as the median and inner-quartile range (IQR) of the unit price for each product, using the 'tapply()' function. Be sure to use only the transactions that are not labeled 'fraud', or else our averages would not be accurate. The we implement the 'tapply()' function to grab the median and quartile variables from 'boxplot.stats()'. Then we take the list output from 'tapply()' and transform it into a matrix, after we 'unlist()' the output.

Robust Measures:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> load('salesClean.rdata')
> attach(sales)
> # find non-fraud indexes
> notF <- which(Insp != 'fraud')
> # median and IQR of unit-price for each product
> ms <- tapply(Uprice[notF], list(Prod=Prod[notF]), 
+              function(x) {
+                bp <- boxplot.stats(x)$stats
+                # returns median and iqr = 75% - 25% percentiles
+                c(median=bp[3],iqr=bp[4]-bp[2])
+              })
> # transforms into a matrix
> # with median unit price value for each product, along with iqr
> ms <- matrix(unlist(ms),
+              length(ms), 2, byrow=T, 
+              dimnames=list(names(ms), c('median', 'iqr')))
> head(ms)
      median      iqr
p1 11.346154 8.575599
p2 10.877863 5.609731
p3 10.000000 4.809092
p4  9.911243 5.998530
p5 10.957447 7.136601
p6 13.223684 6.685185
> 

Now we have a matrix of the median and IQR unit price for each product using non-fraud transactions. Note the printed median and IQR values for the first 6 products. Below, we plot the median and IQR unit prices twice, unscaled on the left, and in log scale on the right to accommodate the extreme values. Because the log plot on the right will have a more even distribution of values, we will plot them in grey, and later overlay them with unit price points of products with less than 20 transactions.


Plotting Measures:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
> # plot the median and unit price for each product
> # also show log of plot to accomodate the extreme values
> par(mfrow= c(1,2))
> plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="")
> # now grey log points
> plot(ms[,1], ms[,2], xlab="Median", ylab="IQR", main="",
+      col="grey", "log"="xy")
>
> # which are the few products less than 20 transactions
> smalls <- which(table(Prod) < 20)
> # draw the black +'s over the grey log points 
#>  to highlight products with few transactions
> points(log(ms[smalls,1]), log(ms[smalls,2]), pch="+")

Below, we see why the only log plot on the right needed to be grey. The extreme value on top-right of the unscaled plot completely relegates the rest of the points in a dense cluster.  We would not have been able to see the overlay points anyways. However, with the log transform, the distribution of robust measures of unit prices is more clearly visualized. The black points representing products with less than 20 transactions fall in line with the rest of products.


Figure 1. Plots of Median and IQR Product Values in 2 Scales
We observe that many products have similar medians and IQR spreads, which is good for those products with fewer transactions. Even so, we need to examine the variables of those products with few transactions more closely.

Products with Few Transactions

We might assess outlier transactions from products with few transactions by grouping them together with products with similar distributions to gain statistical significance. From the above plots, there are many products with similar medians and IQR spreads. That is why we earlier obtained the median and IQR values for each product, both resistant to extreme outlier values. We assume that the unit-price for a product is normally distributed around the middle median value, with spread of the IQR.


However not all of the fewer transaction products have distributions similar to other 'normal' products. So we will have more difficulty determining whether those transactions are fraudulent or not with statistical confidence. 

Following the robust measures theme, we proceed with a robust nonparametric test of the similarity of distributions, the Kolmogorov-Smirnov test. The K-S test informs us of the null hypothesis that two samples come from the same distribution. The statistic we obtain from the K-S test gives us the maximum difference between two empirical cumulative distribution functions.


Running the K-S Test



Using the 'ms' matrix object with the median and IQR values we generated earlier, we scale 'ms' to 'dms', then create 'smalls' which contains the integer index of our scarce products. The 'prods' list contains the unit price for each transaction for each product. Then we finish preparation by generating an empty NA matrix with rows the length of 'smalls', and 7 columns for the resulting statistics.

Beginning in line 23 we start to loop through the scarce products. The first operation on line 24 is crucial. We use 'scale()' to subtract the entire 'dms' by each row with the median and IQR values for the ith scarce product. That way each scarce product is compared to all the other products. Line 26 removes any negative values and multiplies the matrices together to form a 'difference' value, and in line 29 we run the K-S test with 'ks.test()',  comparing the unit-prices of that ith scarce product to the second smallest difference value. Why the second smallest? Because the smallest difference would be zero, or the same product median and IQR values as itself- remember the scaling in 'd' applies to all the products. To record the results of the ith product, we store it in the ith row, in the 'similar' matrix we created, which happens to have the number of rows as number of scarce products. We store the integer of similar product in the first column, the K-S statistic in the second, and then storing the unscaled median and IQR values for the scarce product and the similar product.

Finding Similar Products:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
> # K-S Test for similar transactions ####
> # for products with few transactions
> # create matrix table with similar product
> # from the Kolmogorov-Smirnov test
> # includes KS statistic, pvalue
> # and median and iqr for both product and similar Product
> 
> dms <- scale(ms)
> smalls <- which(table(Prod) < 20)
> prods <- tapply(sales$Uprice, sales$Prod, list)
> # create emtpy similar matrix, fit for few product #rows and 7 variables
> similar <- matrix(NA, length(smalls), 7,
+   # add row and column names
+   dimnames=list(names(smalls),
+   c("Similar", "ks.stat", "ks.p",
+   "medianP", "iqrP",
+   "medianSim", "iqrSim")))
> # iterate through each row in the median iqr Uprice matrix
> # with few transaction products only.
> # scale to all to current row to find most similar product
> # using the KS test in all products.
> # does not iterate through all products in dms.
> for(i in seq(along=smalls)) {
+   d <- scale(dms, dms[smalls[i], ], FALSE)
+   # removes negatives through matrix multiplication
+   d <- sqrt(drop(d^2 %*% rep(1, ncol(d))))
+   # ks test of current product and next best product with lowest difference
+   # because best product is itself
+   stat <- ks.test(prods[[smalls[i]]], prods[[order(d)[2]]])
+   # add results to the similar matrix:
+   # similar product, KS statistic, KS pval,
+   # product values, similar product values (median, iqr)
+   similar[i, ] <- c(order(d)[2], stat$statistic, stat$p.value,
+   ms[smalls[i], ], ms[order(d)[2], ])
+ }
> 
> head(similar) # so first product p8's similar product is p2829
    Similar   ks.stat       ks.p  medianP      iqrP medianSim    iqrSim
p8     2827 0.4339623 0.06470603 3.850211 0.7282168  3.868306 0.7938557
p18     213 0.2568922 0.25815859 5.187266 8.0359968  5.274884 7.8894149
p38    1044 0.3650794 0.11308315 5.490758 6.4162095  5.651818 6.3248073
p39    1540 0.2258065 0.70914769 7.986486 1.6425959  8.080694 1.7668724
p40    3971 0.3333333 0.13892028 9.674797 1.6104511  9.668854 1.6520147
p47    1387 0.3125000 0.48540576 2.504092 2.5625835  2.413498 2.6402087
> # confirm using levels
> levels(Prod)[similar[1,1]]
[1] "p2829"

We see from the results in 'similar' that the most similar product to first scarce product, 'p8', is 'p2827'. A quick check at line 46 verifies the matched product ID. It is nearly statistically significant (small p-value = 0.0647), which implies that there is a small probability of the differences occurring by chance- that while 'p2827' is the most similar, it can nearly reject the null hypothesis of the K-S test where the two samples come from the same distribution. We want the similar products to be from similar distributions to that of scarce products so we can group them together as similar products in order to overcome their few number of transactions. 
So we are looking for p-values at the other extreme, close to 1, which indicates that the two distributions are (nearly) equal.

Evaluation of Valid Products


Now that we have the 'similar' matrix, we can examine how many products have similar products sufficiently similar in distribution within a 90% confidence interval. Therefore, we rely on a p-value of 0.9, because we want to be as close to the null of equal distributions as possible. So below, we take the sum of the row-wise logic operation.

Viable Products:

1
2
3
4
5
6
7
8
> # check how many products have unit price distribution that is
> # significantly similar with 90% CI with KS value being 0.9 or greater
> sum(similar[, "ks.p"] >= 0.9) # 117
[1] 117
> dim(similar)
[1] 985   7
> save(similar, file="similarProducts.Rdata")
> 

Observe 117 products have sufficiently similar products, out of 985 products with less than 20 transactions. So while we can match almost 12% of the scarce products with another product, there are still 868 products with less than 20 transactions. However do not despair! Though we have quite a few we were not able to match, we did capture 117 products which we would have otherwise had trouble obtaining accurate fraudulent predictions later. (Remember to save the 'similar' object!)


Summary

Here we explored tracking non-traditional outliers in terms of products few number of transactions for the future purpose of more accurate fraud prediction. We used the Kolmogorov-Smirnov test to evaluate and find other products with most similar distributions using robust measures of median and IQR. The results show that 117 of the 985 scarce products have similar transactions with 90% CI. 

Next we will discuss the evaluation criteria of classifying fraudulent transactions from a predictive model. How do we know how well the model performed? How does one model compare to another? Stay tuned!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

Sunday, July 27, 2014

Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data


Hello Readers,

"We have missing data." How many times have you heard that sentence and cringed inside? If you worked with user-generated data before, you most likely have happened upon the "NULL" or "NA" or "99999" values, which possibly diminished your usable data. It gets me every time, too. After a while, you get used to those encounters, because as data scientists, we have the tools and methods to overcome incomplete data, and still carry out the analysis. In this post, I discuss missing data values in our Case Study for Fraudulent Transactions.


One Possible 'Option' When Encountering Missing Data

This is Part 2 of the Case Study, and here we handle the missing data. Let's begin where we left off in Part 1.


(This is a series from Luis Torgo's  Data Mining with R book.)


Dual Unknown Values


When we took the "summary()" of our SALES data, we found our "Quant" and "Val" variables had NA values. For each transaction, the quantity of the product sold and the value of the sale are important predictors (only, in this case) of being a fraudulent transaction or deemed "ok". So when 888 transactions are missing both variables, we find it difficult to impute any value due to the number of unknowns. If we had a "Quant" or "Val" variable available, we might have been able to use the unit-price of the product (value/quantity) to calculate the missing variable.

Salespeople and Product NAs Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> attach(sales)
> 
> totS <- table(ID)
> totP <- table(Prod)
> 
> # transactions per values and product for NAs
> nas <- sales[which(is.na(Quant) & is.na(Val)), c("ID", "Prod")]
> 
> # obtain salesppl NAs
> propS <- 100* table(nas$ID)/totS
> propS[order(propS, decreasing=T)[1:10]]

    v1237     v4254     v4038     v5248     v3666     v4433     v4170 
13.793103  9.523810  8.333333  8.333333  6.666667  6.250000  5.555556 
    v4926     v4664     v4642 
 5.555556  5.494505  4.761905 
> #
> # the transactions represent a small proportion of transactions
> # looking at the products:
> propP <- 100* table(nas$Prod)/totP
> propP[order(propP, decreasing=T)[1:10]]

   p2689    p2675    p4061    p2780    p4351    p2686    p2707    p2690 
39.28571 35.41667 25.00000 22.72727 18.18182 16.66667 14.28571 14.08451 
   p2691    p2670 
12.90323 12.76596 
> 
> # several products more than 20% of their transactions would be
> # removed: that is too much, p2689 would have 40% removed
> detach(sales)
> sales <- sales[-which(is.na(sales$Quant) & is.na(sales$Val)),]

Examining the salespeople first, we create a table for the number of transactions by each salesperson. Then we subset the SALES data, pulling the transactions with NAs in "Quant" and "Val", with the information on the "ID" and "Prod" (salesperson ID and product ID). Dividing this table of NA present transactions by the full table of all the transactions, measures the proportion of NAs in each salesperson's transactions. Taking the top 10 salespeople who NAs in their transactions with "order()", we discover that salesperson "v1237" had the highest percentage of transactions with dual NAs, at 13.8%. That percentage is not too high, and all the other salespeople have lower percentages. We can breathe easy, since not one single salesperson had the majority of his or her transaction reports filled with NAs. So if we remove transactions with dual NAs, not one single salesperson will be overly affected.


Investigating the products next, we do the same procedure and create tables for all of the products, for products with NAs, and the percentage of products with NAs with division. Looking at the 10 products with missing values, product "p2689" has nearly 40 of its transactions incomplete. Unlike the NAs grouped by salespeople, if we delete the dual NA transactions, product "p2689", and "p2675" will have over 35% of their transactions removed, and 4 products would have at least 20% removed! Clearly some products have more missing values than others.



Alternatives


There are generally 3 alternatives available to us as options when we encounter missing data. The first is to remove those rows. Second, fill in the missing values using a calculated method, or third, use tools to handle those values. If you work with an industry specific program or have specific handling instructions, then option three would be the best decision. But here, we can only choose from the first two options.

We could impute unit-prices for the missing products, but with product "p2689", we would only have 60% of the data to fill in the 40%. If there are too many transactions removed, we would then join those transactions with ones from similar products for outlier detection tests. The most best option would be to remove those transactions. So "detach()" the SALES data and remove those transactions with both missing quantity and value elements, via sub-setting the SALES data.


Single Unknown Values


Now that we have resolved the unknown values in both quantity and value variables, we refocus onto transactions with one unknown in either variable. There were 888 transactions with both missing, and for the single missing value, there are 13,248 transactions, nearly 15 times as many. Let us first start with the quantity variable, "Quant".

We stratify the quantities by the product code, thus searching for the proportion of NAs present in the quantities of a certain product. We flag a product if a high number of its transactions have missing quantities.

Missing Quantity for Products Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
> # move to unknown value either quantity or value ####
> # how many transactions?
> #
> length(which(is.na(sales$Quant) | is.na(sales$Val)))
[1] 13248
> #
> # check missing quantity first
> nnasQp <- tapply(sales$Quant, list(sales$Prod),
+                  function(x) sum(is.na(x)))
> propNAsQp <- nnasQp/table(sales$Prod)
> propNAsQp[order(propNAsQp, decreasing=T)[1:10]]
    p2442     p2443     p1653     p4101     p4243      p903     p3678 
1.0000000 1.0000000 0.9090909 0.8571429 0.6842105 0.6666667 0.6666667 
    p3955     p4464     p1261 
0.6428571 0.6363636 0.6333333 
> # there are two products with all transactions of unknown
> # quantity, p2442, p2443
> sales[sales$Prod %in% c("p2442", "p2443"),]
          ID  Prod Quant    Val  Insp
21259  v2921 p2442    NA   5715  unkn
21260  v2922 p2442    NA  21250  unkn
21261  v2923 p2442    NA 102210  unkn
21262  v2922 p2442    NA 213155  unkn
21263  v2924 p2442    NA   4870  unkn
21264  v2925 p2443    NA  17155    ok
58422  v2924 p2442    NA   4870  unkn
58423  v2922 p2442    NA  21250  unkn
58424  v4356 p2442    NA  53815  unkn
58425  v2922 p2442    NA 213155  unkn
58426  v2925 p2443    NA   3610 fraud
58427  v4355 p2443    NA   5260  unkn
58428  v4355 p2443    NA   1280 fraud
102076 v2921 p2442    NA 137625  unkn
102077 v2920 p2442    NA  21310  unkn
102078 v4839 p2442    NA   5190  unkn
102079 v4356 p2442    NA  11320  unkn
102080 v2922 p2442    NA  34180  unkn
102081 v2925 p2443    NA   3610  unkn
102082 v4355 p2443    NA   5260  unkn
102083 v4355 p2443    NA   1280  unkn
102084 v2925 p2443    NA   3075  unkn
153543 v5077 p2442    NA   7720  unkn
153544 v2924 p2442    NA   9620  unkn
153545 v2920 p2442    NA  34365  unkn
153546 v2925 p2443    NA   3455  unkn
195784 v5077 p2442    NA   7720  unkn
195785 v4356 p2442    NA  43705  unkn
195786 v2939 p2443    NA   5465  unkn
195787 v2925 p2443    NA  14990  unkn
252153 v2924 p2442    NA   4870  unkn
252154 v2921 p2442    NA 137625  unkn
252155 v5077 p2442    NA   7720  unkn
252156 v2922 p2442    NA  66820  unkn
252157 v5077 p2442    NA  12035  unkn
252158 v2920 p2442    NA  79320  unkn
252159 v2925 p2443    NA   3610  unkn
325280 v2924 p2442    NA   4870  unkn
325281 v2921 p2442    NA 137625  unkn
325282 v5077 p2442    NA   7720  unkn
325283 v2922 p2442    NA  66820  unkn
325284 v5077 p2442    NA  12350  unkn
325285 v5077 p2442    NA  12035  unkn
325286 v2920 p2442    NA  43180  unkn
325289 v2925 p2443    NA   3610  unkn
325290 v4355 p2443    NA   5260  unkn
325291 v4355 p2443    NA   1280  unkn
325292 v2925 p2443    NA   2890  unkn
390840 v5077 p2442    NA  11515  unkn
390841 v4356 p2442    NA   4695  unkn
390842 v2923 p2442    NA  15580  unkn
390843 v2920 p2442    NA  27320  unkn
390844 v6044 p2442    NA  21215  unkn
390845 v4356 p2442    NA  53190  unkn
> # delete them because both have OK and Fraud
> sales <- sales[!sales$Prod %in% c("p2442", "p2443"),]
> # update levels
> #
> nlevels(sales$Prod) # 4548
[1] 4548
> sales$Prod <- factor(sales$Prod)
> nlevels(sales$Prod) # 4846
[1] 4546
> # now has correct number, after we removed the 2 products

Looking at the proportions, product "p2442" and "p2443" lack the quantity metric all of their transactions! Also, "p1653" has 90% missing, and "p4101" has 86% missing. These are quite stark numbers. For those products that lack all of their quantity values, looking at their fraud inspection status, we see some labeled "ok" and some labeled "fraud". Because it is not statistically sound to use those evaluations given the lack of data, we will remove those two problem products. Additionally, remember to update the levels in the "Prod" variable, since we removed "p2442" and "p2443".


Missing Quantity by Salespeople Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
> # check salesppl with transactions of unknown quantity
> nnasQs <- tapply(sales$Quant, list(sales$ID),
+                  function(x) sum(is.na(x)))
> propNAsQs <- nnasQs/table(sales$ID)
> propNAsQs[order(propNAsQs, decreasing=T)[1:10]]
    v2925     v5537     v5836     v6058     v6065     v4368     v2923 
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8888889 0.8750000 
    v2970     v4910     v4542 
0.8571429 0.8333333 0.8095238 
> # quite a few salesppl did not fill out the quantity info
> # but we can use numbers from other sales ppl under unitprice
> # so no need to delete

Here we again take the table of number of missing quantities by salespeople and find the proportion by dividing with the table of salespeople. Then taking the top 10 results, we see that salesperson "v2925", "v5537", "v5836", "v6058", and "v6065", have all of their transactions missing the quantity variable. However, as long as we have transactions of the same product by other people, we can use the unit-price to calculate the quantity from the present value element.


Missing Value by Product Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # check unknown values
> # by product
> nnasVp <- tapply(sales$Val, list(sales$Prod),
+                  function(x) sum(is.na(x)))
> propNAsVp <- nnasVp/table(sales$Prod)
> propNAsVp[order(propNAsVp, decreasing=T)[1:10]]
     p1110      p1022      p4491      p1462        p80      p4307 
0.25000000 0.17647059 0.10000000 0.07500000 0.06250000 0.05882353 
     p4471      p2821      p1017      p4287 
0.05882353 0.05389222 0.05263158 0.05263158 
> # reasonable results, no need to delete

Using the similar pattern of finding the proportion of missing "Val" values for each product, we pleasantly discover no high percentages. Since no product has high NAs proportions, we do not need to delete them.


Missing Value by Salespeople Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # unknown values by salespeople
> #
> nnasVs <- tapply(sales$Val, list(sales$ID),
+ function(x) sum(is.na(x)))
> propNAsVs <- nnasVs/table(sales$ID)
> propNAsVs[order(propNAsVs, decreasing=T)[1:10]]
     v5647        v74      v5946      v5290      v4472      v4022 
0.37500000 0.22222222 0.20000000 0.15384615 0.12500000 0.09756098 
      v975      v2814      v2892      v3739 
0.09574468 0.09090909 0.09090909 0.08333333 
> # reasonable results again

Now examining the missing values in "Val" by salespeople, we observe no salesperson with overly high proportions of NAs involving "Val". We have acceptable results, with no need to delete any more transactions.



Imputing Missing Data


Now that we have removed the transactions with insufficient information, we can fill in the remaining values using our fill-in strategy of relying on the unit-price. Also, we need to skip those transactions previously audited and labeled as "fraud". We utilize the median unit price of transactions as the typical price for their respective products. (I saved the SALES data because we took out invalid transactions, you might want to do the same.) To find the median unit-price without consulting those fraudulent transactions, we specify those transactions with the "Insp" (inspection) variable as not equal to fraud, '!= "fraud" '.

Filling In 'Quant' and 'Val' Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> load("sales.rdata") ####
> # imputing ####
> #
> # calculate median price, while ignoring frauds
> #
> tPrice <- tapply(sales[sales$Insp != "fraud", "Uprice"],
+                  list(sales[sales$Insp != "fraud", "Prod"]),
+                  median, na.rm=T)
> # now we can use median unit-price to calculate Quant and Val
> #
> # we already eliminated transactions with both missing
> #
> # begin with Quant, find missing Quants
> noQuant <- which(is.na(sales$Quant))
> sum(is.na(sales$Quant))
[1] 12900
> # Imputing Quant
> # round Quant up
> sales[noQuant,'Quant'] <- ceiling(sales[noQuant, 'Val']/
+                           tPrice[sales[noQuant,"Prod"]])
> #
> # next Val, find missing Vals
> noVal <- which(is.na(sales$Val))
> sum(is.na(sales$Val))
[1] 294
> # impute Vals
> sales[noVal, 'Val'] <- sales[noVal,'Quant']*
+                        tPrice[sales[noVal, 'Prod']]

We fill in the missing 'Quant' values by creating a missing index, and discover 12,900 transactions ready to be imputed. Then we round all the 'Quant' values we will impute up, since quantity is an integer. Remember, value/qantity = unit-price, so we divide value by the unit-price to obtain the quantity values.


Next we tackle the missing 'Val' values, create the missing 'Val' index, and we find 294 we can fill in. Again, using the simple formula, we multiply the quantity by the unit-price to obtain the value.



Clean Up


Now that we have no more unknown values in 'Quant' or 'Val', we have a complete, or clean dataset. But we are not finished! Since we have all the quantity and values, we can recalculate the unit-price with all the values present. And make sure to save this SALES dataset, in case you have not yet already. Naming the file 'salesClean.rdata' allows us to differentiate the regular SALES set with the clean SALES set.

Recalculating Unit-Price Code:
1
2
3
4
5
6
> # all unit-prices present, so recalculate
> #
> sales$Uprice <- sales$Val/sales$Quant
> # now we have a dataset free of unknowns
> # save our progress!
> save(sales, file='salesClean.rdata')



Summary



Whew! Handling missing data is a necessary task, and when to remove rows entries requires data wrangling experience and reliance/knowledge of the other data present. OK folks, so here we located and quantified the unknown values in the variables, identified which rows (transactions) we needed and which to remove, and removed those while imputing values from the same products using the median unit-price. Sometimes in user-generated data, a variable might not be composed of other variables, so imputation is impossible. Other times, we can infer values based on other variables. Either way, working with 'clean' data is not a privilege, since we usually have to work to refine it!