Hello Readers,
"We have missing data." How many times have you heard that sentence and cringed inside? If you worked with user-generated data before, you most likely have happened upon the "NULL" or "NA" or "99999" values, which possibly diminished your usable data. It gets me every time, too. After a while, you get used to those encounters, because as data scientists, we have the tools and methods to overcome incomplete data, and still carry out the analysis. In this post, I discuss missing data values in our Case Study for Fraudulent Transactions.
One Possible 'Option' When Encountering Missing Data |
This is Part 2 of the Case Study, and here we handle the missing data. Let's begin where we left off in Part 1.
(This is a series from Luis Torgo's Data Mining with R book.)
Dual Unknown Values
When we took the "summary()" of our SALES data, we found our "Quant" and "Val" variables had NA values. For each transaction, the quantity of the product sold and the value of the sale are important predictors (only, in this case) of being a fraudulent transaction or deemed "ok". So when 888 transactions are missing both variables, we find it difficult to impute any value due to the number of unknowns. If we had a "Quant" or "Val" variable available, we might have been able to use the unit-price of the product (value/quantity) to calculate the missing variable.
Salespeople and Product NAs Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | > attach(sales) > > totS <- table(ID) > totP <- table(Prod) > > # transactions per values and product for NAs > nas <- sales[which(is.na(Quant) & is.na(Val)), c("ID", "Prod")] > > # obtain salesppl NAs > propS <- 100* table(nas$ID)/totS > propS[order(propS, decreasing=T)[1:10]] v1237 v4254 v4038 v5248 v3666 v4433 v4170 13.793103 9.523810 8.333333 8.333333 6.666667 6.250000 5.555556 v4926 v4664 v4642 5.555556 5.494505 4.761905 > # > # the transactions represent a small proportion of transactions > # looking at the products: > propP <- 100* table(nas$Prod)/totP > propP[order(propP, decreasing=T)[1:10]] p2689 p2675 p4061 p2780 p4351 p2686 p2707 p2690 39.28571 35.41667 25.00000 22.72727 18.18182 16.66667 14.28571 14.08451 p2691 p2670 12.90323 12.76596 > > # several products more than 20% of their transactions would be > # removed: that is too much, p2689 would have 40% removed > detach(sales) > sales <- sales[-which(is.na(sales$Quant) & is.na(sales$Val)),] |
Examining the salespeople first, we create a table for the number of transactions by each salesperson. Then we subset the SALES data, pulling the transactions with NAs in "Quant" and "Val", with the information on the "ID" and "Prod" (salesperson ID and product ID). Dividing this table of NA present transactions by the full table of all the transactions, measures the proportion of NAs in each salesperson's transactions. Taking the top 10 salespeople who NAs in their transactions with "order()", we discover that salesperson "v1237" had the highest percentage of transactions with dual NAs, at 13.8%. That percentage is not too high, and all the other salespeople have lower percentages. We can breathe easy, since not one single salesperson had the majority of his or her transaction reports filled with NAs. So if we remove transactions with dual NAs, not one single salesperson will be overly affected.
Investigating the products next, we do the same procedure and create tables for all of the products, for products with NAs, and the percentage of products with NAs with division. Looking at the 10 products with missing values, product "p2689" has nearly 40 of its transactions incomplete. Unlike the NAs grouped by salespeople, if we delete the dual NA transactions, product "p2689", and "p2675" will have over 35% of their transactions removed, and 4 products would have at least 20% removed! Clearly some products have more missing values than others.
Alternatives
There are generally 3 alternatives available to us as options when we encounter missing data. The first is to remove those rows. Second, fill in the missing values using a calculated method, or third, use tools to handle those values. If you work with an industry specific program or have specific handling instructions, then option three would be the best decision. But here, we can only choose from the first two options.
We could impute unit-prices for the missing products, but with product "p2689", we would only have 60% of the data to fill in the 40%. If there are too many transactions removed, we would then join those transactions with ones from similar products for outlier detection tests. The most best option would be to remove those transactions. So "detach()" the SALES data and remove those transactions with both missing quantity and value elements, via sub-setting the SALES data.
Single Unknown Values
Now that we have resolved the unknown values in both quantity and value variables, we refocus onto transactions with one unknown in either variable. There were 888 transactions with both missing, and for the single missing value, there are 13,248 transactions, nearly 15 times as many. Let us first start with the quantity variable, "Quant".
We stratify the quantities by the product code, thus searching for the proportion of NAs present in the quantities of a certain product. We flag a product if a high number of its transactions have missing quantities.
Missing Quantity for Products Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | > # move to unknown value either quantity or value #### > # how many transactions? > # > length(which(is.na(sales$Quant) | is.na(sales$Val))) [1] 13248 > # > # check missing quantity first > nnasQp <- tapply(sales$Quant, list(sales$Prod), + function(x) sum(is.na(x))) > propNAsQp <- nnasQp/table(sales$Prod) > propNAsQp[order(propNAsQp, decreasing=T)[1:10]] p2442 p2443 p1653 p4101 p4243 p903 p3678 1.0000000 1.0000000 0.9090909 0.8571429 0.6842105 0.6666667 0.6666667 p3955 p4464 p1261 0.6428571 0.6363636 0.6333333 > # there are two products with all transactions of unknown > # quantity, p2442, p2443 > sales[sales$Prod %in% c("p2442", "p2443"),] ID Prod Quant Val Insp 21259 v2921 p2442 NA 5715 unkn 21260 v2922 p2442 NA 21250 unkn 21261 v2923 p2442 NA 102210 unkn 21262 v2922 p2442 NA 213155 unkn 21263 v2924 p2442 NA 4870 unkn 21264 v2925 p2443 NA 17155 ok 58422 v2924 p2442 NA 4870 unkn 58423 v2922 p2442 NA 21250 unkn 58424 v4356 p2442 NA 53815 unkn 58425 v2922 p2442 NA 213155 unkn 58426 v2925 p2443 NA 3610 fraud 58427 v4355 p2443 NA 5260 unkn 58428 v4355 p2443 NA 1280 fraud 102076 v2921 p2442 NA 137625 unkn 102077 v2920 p2442 NA 21310 unkn 102078 v4839 p2442 NA 5190 unkn 102079 v4356 p2442 NA 11320 unkn 102080 v2922 p2442 NA 34180 unkn 102081 v2925 p2443 NA 3610 unkn 102082 v4355 p2443 NA 5260 unkn 102083 v4355 p2443 NA 1280 unkn 102084 v2925 p2443 NA 3075 unkn 153543 v5077 p2442 NA 7720 unkn 153544 v2924 p2442 NA 9620 unkn 153545 v2920 p2442 NA 34365 unkn 153546 v2925 p2443 NA 3455 unkn 195784 v5077 p2442 NA 7720 unkn 195785 v4356 p2442 NA 43705 unkn 195786 v2939 p2443 NA 5465 unkn 195787 v2925 p2443 NA 14990 unkn 252153 v2924 p2442 NA 4870 unkn 252154 v2921 p2442 NA 137625 unkn 252155 v5077 p2442 NA 7720 unkn 252156 v2922 p2442 NA 66820 unkn 252157 v5077 p2442 NA 12035 unkn 252158 v2920 p2442 NA 79320 unkn 252159 v2925 p2443 NA 3610 unkn 325280 v2924 p2442 NA 4870 unkn 325281 v2921 p2442 NA 137625 unkn 325282 v5077 p2442 NA 7720 unkn 325283 v2922 p2442 NA 66820 unkn 325284 v5077 p2442 NA 12350 unkn 325285 v5077 p2442 NA 12035 unkn 325286 v2920 p2442 NA 43180 unkn 325289 v2925 p2443 NA 3610 unkn 325290 v4355 p2443 NA 5260 unkn 325291 v4355 p2443 NA 1280 unkn 325292 v2925 p2443 NA 2890 unkn 390840 v5077 p2442 NA 11515 unkn 390841 v4356 p2442 NA 4695 unkn 390842 v2923 p2442 NA 15580 unkn 390843 v2920 p2442 NA 27320 unkn 390844 v6044 p2442 NA 21215 unkn 390845 v4356 p2442 NA 53190 unkn > # delete them because both have OK and Fraud > sales <- sales[!sales$Prod %in% c("p2442", "p2443"),] > # update levels > # > nlevels(sales$Prod) # 4548 [1] 4548 > sales$Prod <- factor(sales$Prod) > nlevels(sales$Prod) # 4846 [1] 4546 > # now has correct number, after we removed the 2 products |
Looking at the proportions, product "p2442" and "p2443" lack the quantity metric all of their transactions! Also, "p1653" has 90% missing, and "p4101" has 86% missing. These are quite stark numbers. For those products that lack all of their quantity values, looking at their fraud inspection status, we see some labeled "ok" and some labeled "fraud". Because it is not statistically sound to use those evaluations given the lack of data, we will remove those two problem products. Additionally, remember to update the levels in the "Prod" variable, since we removed "p2442" and "p2443".
Missing Quantity by Salespeople Code:
1 2 3 4 5 6 7 8 9 10 11 12 | > # check salesppl with transactions of unknown quantity > nnasQs <- tapply(sales$Quant, list(sales$ID), + function(x) sum(is.na(x))) > propNAsQs <- nnasQs/table(sales$ID) > propNAsQs[order(propNAsQs, decreasing=T)[1:10]] v2925 v5537 v5836 v6058 v6065 v4368 v2923 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8888889 0.8750000 v2970 v4910 v4542 0.8571429 0.8333333 0.8095238 > # quite a few salesppl did not fill out the quantity info > # but we can use numbers from other sales ppl under unitprice > # so no need to delete |
Here we again take the table of number of missing quantities by salespeople and find the proportion by dividing with the table of salespeople. Then taking the top 10 results, we see that salesperson "v2925", "v5537", "v5836", "v6058", and "v6065", have all of their transactions missing the quantity variable. However, as long as we have transactions of the same product by other people, we can use the unit-price to calculate the quantity from the present value element.
Missing Value by Product Code:
1 2 3 4 5 6 7 8 9 10 11 | > # check unknown values > # by product > nnasVp <- tapply(sales$Val, list(sales$Prod), + function(x) sum(is.na(x))) > propNAsVp <- nnasVp/table(sales$Prod) > propNAsVp[order(propNAsVp, decreasing=T)[1:10]] p1110 p1022 p4491 p1462 p80 p4307 0.25000000 0.17647059 0.10000000 0.07500000 0.06250000 0.05882353 p4471 p2821 p1017 p4287 0.05882353 0.05389222 0.05263158 0.05263158 > # reasonable results, no need to delete |
Using the similar pattern of finding the proportion of missing "Val" values for each product, we pleasantly discover no high percentages. Since no product has high NAs proportions, we do not need to delete them.
Missing Value by Salespeople Code:
1 2 3 4 5 6 7 8 9 10 11 | > # unknown values by salespeople > # > nnasVs <- tapply(sales$Val, list(sales$ID), + function(x) sum(is.na(x))) > propNAsVs <- nnasVs/table(sales$ID) > propNAsVs[order(propNAsVs, decreasing=T)[1:10]] v5647 v74 v5946 v5290 v4472 v4022 0.37500000 0.22222222 0.20000000 0.15384615 0.12500000 0.09756098 v975 v2814 v2892 v3739 0.09574468 0.09090909 0.09090909 0.08333333 > # reasonable results again |
Now examining the missing values in "Val" by salespeople, we observe no salesperson with overly high proportions of NAs involving "Val". We have acceptable results, with no need to delete any more transactions.
Imputing Missing Data
Now that we have removed the transactions with insufficient information, we can fill in the remaining values using our fill-in strategy of relying on the unit-price. Also, we need to skip those transactions previously audited and labeled as "fraud". We utilize the median unit price of transactions as the typical price for their respective products. (I saved the SALES data because we took out invalid transactions, you might want to do the same.) To find the median unit-price without consulting those fraudulent transactions, we specify those transactions with the "Insp" (inspection) variable as not equal to fraud, '!= "fraud" '.
Filling In 'Quant' and 'Val' Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | > load("sales.rdata") #### > # imputing #### > # > # calculate median price, while ignoring frauds > # > tPrice <- tapply(sales[sales$Insp != "fraud", "Uprice"], + list(sales[sales$Insp != "fraud", "Prod"]), + median, na.rm=T) > # now we can use median unit-price to calculate Quant and Val > # > # we already eliminated transactions with both missing > # > # begin with Quant, find missing Quants > noQuant <- which(is.na(sales$Quant)) > sum(is.na(sales$Quant)) [1] 12900 > # Imputing Quant > # round Quant up > sales[noQuant,'Quant'] <- ceiling(sales[noQuant, 'Val']/ + tPrice[sales[noQuant,"Prod"]]) > # > # next Val, find missing Vals > noVal <- which(is.na(sales$Val)) > sum(is.na(sales$Val)) [1] 294 > # impute Vals > sales[noVal, 'Val'] <- sales[noVal,'Quant']* + tPrice[sales[noVal, 'Prod']] |
We fill in the missing 'Quant' values by creating a missing index, and discover 12,900 transactions ready to be imputed. Then we round all the 'Quant' values we will impute up, since quantity is an integer. Remember, value/qantity = unit-price, so we divide value by the unit-price to obtain the quantity values.
Next we tackle the missing 'Val' values, create the missing 'Val' index, and we find 294 we can fill in. Again, using the simple formula, we multiply the quantity by the unit-price to obtain the value.
Clean Up
Now that we have no more unknown values in 'Quant' or 'Val', we have a complete, or clean dataset. But we are not finished! Since we have all the quantity and values, we can recalculate the unit-price with all the values present. And make sure to save this SALES dataset, in case you have not yet already. Naming the file 'salesClean.rdata' allows us to differentiate the regular SALES set with the clean SALES set.
Recalculating Unit-Price Code:
1 2 3 4 5 6 | > # all unit-prices present, so recalculate > # > sales$Uprice <- sales$Val/sales$Quant > # now we have a dataset free of unknowns > # save our progress! > save(sales, file='salesClean.rdata') |
Summary
Look out for the next post where we begin to identify those pesky outliers!
Thanks for reading,
Wayne
@beyondvalence
LinkedIn
Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.
Thanks for reading,
Wayne
@beyondvalence
Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.