Pages

Sunday, July 27, 2014

Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data


Hello Readers,

"We have missing data." How many times have you heard that sentence and cringed inside? If you worked with user-generated data before, you most likely have happened upon the "NULL" or "NA" or "99999" values, which possibly diminished your usable data. It gets me every time, too. After a while, you get used to those encounters, because as data scientists, we have the tools and methods to overcome incomplete data, and still carry out the analysis. In this post, I discuss missing data values in our Case Study for Fraudulent Transactions.


One Possible 'Option' When Encountering Missing Data

This is Part 2 of the Case Study, and here we handle the missing data. Let's begin where we left off in Part 1.


(This is a series from Luis Torgo's  Data Mining with R book.)


Dual Unknown Values


When we took the "summary()" of our SALES data, we found our "Quant" and "Val" variables had NA values. For each transaction, the quantity of the product sold and the value of the sale are important predictors (only, in this case) of being a fraudulent transaction or deemed "ok". So when 888 transactions are missing both variables, we find it difficult to impute any value due to the number of unknowns. If we had a "Quant" or "Val" variable available, we might have been able to use the unit-price of the product (value/quantity) to calculate the missing variable.

Salespeople and Product NAs Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> attach(sales)
> 
> totS <- table(ID)
> totP <- table(Prod)
> 
> # transactions per values and product for NAs
> nas <- sales[which(is.na(Quant) & is.na(Val)), c("ID", "Prod")]
> 
> # obtain salesppl NAs
> propS <- 100* table(nas$ID)/totS
> propS[order(propS, decreasing=T)[1:10]]

    v1237     v4254     v4038     v5248     v3666     v4433     v4170 
13.793103  9.523810  8.333333  8.333333  6.666667  6.250000  5.555556 
    v4926     v4664     v4642 
 5.555556  5.494505  4.761905 
> #
> # the transactions represent a small proportion of transactions
> # looking at the products:
> propP <- 100* table(nas$Prod)/totP
> propP[order(propP, decreasing=T)[1:10]]

   p2689    p2675    p4061    p2780    p4351    p2686    p2707    p2690 
39.28571 35.41667 25.00000 22.72727 18.18182 16.66667 14.28571 14.08451 
   p2691    p2670 
12.90323 12.76596 
> 
> # several products more than 20% of their transactions would be
> # removed: that is too much, p2689 would have 40% removed
> detach(sales)
> sales <- sales[-which(is.na(sales$Quant) & is.na(sales$Val)),]

Examining the salespeople first, we create a table for the number of transactions by each salesperson. Then we subset the SALES data, pulling the transactions with NAs in "Quant" and "Val", with the information on the "ID" and "Prod" (salesperson ID and product ID). Dividing this table of NA present transactions by the full table of all the transactions, measures the proportion of NAs in each salesperson's transactions. Taking the top 10 salespeople who NAs in their transactions with "order()", we discover that salesperson "v1237" had the highest percentage of transactions with dual NAs, at 13.8%. That percentage is not too high, and all the other salespeople have lower percentages. We can breathe easy, since not one single salesperson had the majority of his or her transaction reports filled with NAs. So if we remove transactions with dual NAs, not one single salesperson will be overly affected.


Investigating the products next, we do the same procedure and create tables for all of the products, for products with NAs, and the percentage of products with NAs with division. Looking at the 10 products with missing values, product "p2689" has nearly 40 of its transactions incomplete. Unlike the NAs grouped by salespeople, if we delete the dual NA transactions, product "p2689", and "p2675" will have over 35% of their transactions removed, and 4 products would have at least 20% removed! Clearly some products have more missing values than others.



Alternatives


There are generally 3 alternatives available to us as options when we encounter missing data. The first is to remove those rows. Second, fill in the missing values using a calculated method, or third, use tools to handle those values. If you work with an industry specific program or have specific handling instructions, then option three would be the best decision. But here, we can only choose from the first two options.

We could impute unit-prices for the missing products, but with product "p2689", we would only have 60% of the data to fill in the 40%. If there are too many transactions removed, we would then join those transactions with ones from similar products for outlier detection tests. The most best option would be to remove those transactions. So "detach()" the SALES data and remove those transactions with both missing quantity and value elements, via sub-setting the SALES data.


Single Unknown Values


Now that we have resolved the unknown values in both quantity and value variables, we refocus onto transactions with one unknown in either variable. There were 888 transactions with both missing, and for the single missing value, there are 13,248 transactions, nearly 15 times as many. Let us first start with the quantity variable, "Quant".

We stratify the quantities by the product code, thus searching for the proportion of NAs present in the quantities of a certain product. We flag a product if a high number of its transactions have missing quantities.

Missing Quantity for Products Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
> # move to unknown value either quantity or value ####
> # how many transactions?
> #
> length(which(is.na(sales$Quant) | is.na(sales$Val)))
[1] 13248
> #
> # check missing quantity first
> nnasQp <- tapply(sales$Quant, list(sales$Prod),
+                  function(x) sum(is.na(x)))
> propNAsQp <- nnasQp/table(sales$Prod)
> propNAsQp[order(propNAsQp, decreasing=T)[1:10]]
    p2442     p2443     p1653     p4101     p4243      p903     p3678 
1.0000000 1.0000000 0.9090909 0.8571429 0.6842105 0.6666667 0.6666667 
    p3955     p4464     p1261 
0.6428571 0.6363636 0.6333333 
> # there are two products with all transactions of unknown
> # quantity, p2442, p2443
> sales[sales$Prod %in% c("p2442", "p2443"),]
          ID  Prod Quant    Val  Insp
21259  v2921 p2442    NA   5715  unkn
21260  v2922 p2442    NA  21250  unkn
21261  v2923 p2442    NA 102210  unkn
21262  v2922 p2442    NA 213155  unkn
21263  v2924 p2442    NA   4870  unkn
21264  v2925 p2443    NA  17155    ok
58422  v2924 p2442    NA   4870  unkn
58423  v2922 p2442    NA  21250  unkn
58424  v4356 p2442    NA  53815  unkn
58425  v2922 p2442    NA 213155  unkn
58426  v2925 p2443    NA   3610 fraud
58427  v4355 p2443    NA   5260  unkn
58428  v4355 p2443    NA   1280 fraud
102076 v2921 p2442    NA 137625  unkn
102077 v2920 p2442    NA  21310  unkn
102078 v4839 p2442    NA   5190  unkn
102079 v4356 p2442    NA  11320  unkn
102080 v2922 p2442    NA  34180  unkn
102081 v2925 p2443    NA   3610  unkn
102082 v4355 p2443    NA   5260  unkn
102083 v4355 p2443    NA   1280  unkn
102084 v2925 p2443    NA   3075  unkn
153543 v5077 p2442    NA   7720  unkn
153544 v2924 p2442    NA   9620  unkn
153545 v2920 p2442    NA  34365  unkn
153546 v2925 p2443    NA   3455  unkn
195784 v5077 p2442    NA   7720  unkn
195785 v4356 p2442    NA  43705  unkn
195786 v2939 p2443    NA   5465  unkn
195787 v2925 p2443    NA  14990  unkn
252153 v2924 p2442    NA   4870  unkn
252154 v2921 p2442    NA 137625  unkn
252155 v5077 p2442    NA   7720  unkn
252156 v2922 p2442    NA  66820  unkn
252157 v5077 p2442    NA  12035  unkn
252158 v2920 p2442    NA  79320  unkn
252159 v2925 p2443    NA   3610  unkn
325280 v2924 p2442    NA   4870  unkn
325281 v2921 p2442    NA 137625  unkn
325282 v5077 p2442    NA   7720  unkn
325283 v2922 p2442    NA  66820  unkn
325284 v5077 p2442    NA  12350  unkn
325285 v5077 p2442    NA  12035  unkn
325286 v2920 p2442    NA  43180  unkn
325289 v2925 p2443    NA   3610  unkn
325290 v4355 p2443    NA   5260  unkn
325291 v4355 p2443    NA   1280  unkn
325292 v2925 p2443    NA   2890  unkn
390840 v5077 p2442    NA  11515  unkn
390841 v4356 p2442    NA   4695  unkn
390842 v2923 p2442    NA  15580  unkn
390843 v2920 p2442    NA  27320  unkn
390844 v6044 p2442    NA  21215  unkn
390845 v4356 p2442    NA  53190  unkn
> # delete them because both have OK and Fraud
> sales <- sales[!sales$Prod %in% c("p2442", "p2443"),]
> # update levels
> #
> nlevels(sales$Prod) # 4548
[1] 4548
> sales$Prod <- factor(sales$Prod)
> nlevels(sales$Prod) # 4846
[1] 4546
> # now has correct number, after we removed the 2 products

Looking at the proportions, product "p2442" and "p2443" lack the quantity metric all of their transactions! Also, "p1653" has 90% missing, and "p4101" has 86% missing. These are quite stark numbers. For those products that lack all of their quantity values, looking at their fraud inspection status, we see some labeled "ok" and some labeled "fraud". Because it is not statistically sound to use those evaluations given the lack of data, we will remove those two problem products. Additionally, remember to update the levels in the "Prod" variable, since we removed "p2442" and "p2443".


Missing Quantity by Salespeople Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
> # check salesppl with transactions of unknown quantity
> nnasQs <- tapply(sales$Quant, list(sales$ID),
+                  function(x) sum(is.na(x)))
> propNAsQs <- nnasQs/table(sales$ID)
> propNAsQs[order(propNAsQs, decreasing=T)[1:10]]
    v2925     v5537     v5836     v6058     v6065     v4368     v2923 
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8888889 0.8750000 
    v2970     v4910     v4542 
0.8571429 0.8333333 0.8095238 
> # quite a few salesppl did not fill out the quantity info
> # but we can use numbers from other sales ppl under unitprice
> # so no need to delete

Here we again take the table of number of missing quantities by salespeople and find the proportion by dividing with the table of salespeople. Then taking the top 10 results, we see that salesperson "v2925", "v5537", "v5836", "v6058", and "v6065", have all of their transactions missing the quantity variable. However, as long as we have transactions of the same product by other people, we can use the unit-price to calculate the quantity from the present value element.


Missing Value by Product Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # check unknown values
> # by product
> nnasVp <- tapply(sales$Val, list(sales$Prod),
+                  function(x) sum(is.na(x)))
> propNAsVp <- nnasVp/table(sales$Prod)
> propNAsVp[order(propNAsVp, decreasing=T)[1:10]]
     p1110      p1022      p4491      p1462        p80      p4307 
0.25000000 0.17647059 0.10000000 0.07500000 0.06250000 0.05882353 
     p4471      p2821      p1017      p4287 
0.05882353 0.05389222 0.05263158 0.05263158 
> # reasonable results, no need to delete

Using the similar pattern of finding the proportion of missing "Val" values for each product, we pleasantly discover no high percentages. Since no product has high NAs proportions, we do not need to delete them.


Missing Value by Salespeople Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # unknown values by salespeople
> #
> nnasVs <- tapply(sales$Val, list(sales$ID),
+ function(x) sum(is.na(x)))
> propNAsVs <- nnasVs/table(sales$ID)
> propNAsVs[order(propNAsVs, decreasing=T)[1:10]]
     v5647        v74      v5946      v5290      v4472      v4022 
0.37500000 0.22222222 0.20000000 0.15384615 0.12500000 0.09756098 
      v975      v2814      v2892      v3739 
0.09574468 0.09090909 0.09090909 0.08333333 
> # reasonable results again

Now examining the missing values in "Val" by salespeople, we observe no salesperson with overly high proportions of NAs involving "Val". We have acceptable results, with no need to delete any more transactions.



Imputing Missing Data


Now that we have removed the transactions with insufficient information, we can fill in the remaining values using our fill-in strategy of relying on the unit-price. Also, we need to skip those transactions previously audited and labeled as "fraud". We utilize the median unit price of transactions as the typical price for their respective products. (I saved the SALES data because we took out invalid transactions, you might want to do the same.) To find the median unit-price without consulting those fraudulent transactions, we specify those transactions with the "Insp" (inspection) variable as not equal to fraud, '!= "fraud" '.

Filling In 'Quant' and 'Val' Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> load("sales.rdata") ####
> # imputing ####
> #
> # calculate median price, while ignoring frauds
> #
> tPrice <- tapply(sales[sales$Insp != "fraud", "Uprice"],
+                  list(sales[sales$Insp != "fraud", "Prod"]),
+                  median, na.rm=T)
> # now we can use median unit-price to calculate Quant and Val
> #
> # we already eliminated transactions with both missing
> #
> # begin with Quant, find missing Quants
> noQuant <- which(is.na(sales$Quant))
> sum(is.na(sales$Quant))
[1] 12900
> # Imputing Quant
> # round Quant up
> sales[noQuant,'Quant'] <- ceiling(sales[noQuant, 'Val']/
+                           tPrice[sales[noQuant,"Prod"]])
> #
> # next Val, find missing Vals
> noVal <- which(is.na(sales$Val))
> sum(is.na(sales$Val))
[1] 294
> # impute Vals
> sales[noVal, 'Val'] <- sales[noVal,'Quant']*
+                        tPrice[sales[noVal, 'Prod']]

We fill in the missing 'Quant' values by creating a missing index, and discover 12,900 transactions ready to be imputed. Then we round all the 'Quant' values we will impute up, since quantity is an integer. Remember, value/qantity = unit-price, so we divide value by the unit-price to obtain the quantity values.


Next we tackle the missing 'Val' values, create the missing 'Val' index, and we find 294 we can fill in. Again, using the simple formula, we multiply the quantity by the unit-price to obtain the value.



Clean Up


Now that we have no more unknown values in 'Quant' or 'Val', we have a complete, or clean dataset. But we are not finished! Since we have all the quantity and values, we can recalculate the unit-price with all the values present. And make sure to save this SALES dataset, in case you have not yet already. Naming the file 'salesClean.rdata' allows us to differentiate the regular SALES set with the clean SALES set.

Recalculating Unit-Price Code:
1
2
3
4
5
6
> # all unit-prices present, so recalculate
> #
> sales$Uprice <- sales$Val/sales$Quant
> # now we have a dataset free of unknowns
> # save our progress!
> save(sales, file='salesClean.rdata')



Summary



Whew! Handling missing data is a necessary task, and when to remove rows entries requires data wrangling experience and reliance/knowledge of the other data present. OK folks, so here we located and quantified the unknown values in the variables, identified which rows (transactions) we needed and which to remove, and removed those while imputing values from the same products using the median unit-price. Sometimes in user-generated data, a variable might not be composed of other variables, so imputation is impossible. Other times, we can infer values based on other variables. Either way, working with 'clean' data is not a privilege, since we usually have to work to refine it!

No comments:

Post a Comment