Pages

Sunday, July 27, 2014

Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data


Hello Readers,

"We have missing data." How many times have you heard that sentence and cringed inside? If you worked with user-generated data before, you most likely have happened upon the "NULL" or "NA" or "99999" values, which possibly diminished your usable data. It gets me every time, too. After a while, you get used to those encounters, because as data scientists, we have the tools and methods to overcome incomplete data, and still carry out the analysis. In this post, I discuss missing data values in our Case Study for Fraudulent Transactions.


One Possible 'Option' When Encountering Missing Data

This is Part 2 of the Case Study, and here we handle the missing data. Let's begin where we left off in Part 1.


(This is a series from Luis Torgo's  Data Mining with R book.)


Dual Unknown Values


When we took the "summary()" of our SALES data, we found our "Quant" and "Val" variables had NA values. For each transaction, the quantity of the product sold and the value of the sale are important predictors (only, in this case) of being a fraudulent transaction or deemed "ok". So when 888 transactions are missing both variables, we find it difficult to impute any value due to the number of unknowns. If we had a "Quant" or "Val" variable available, we might have been able to use the unit-price of the product (value/quantity) to calculate the missing variable.

Salespeople and Product NAs Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> attach(sales)
> 
> totS <- table(ID)
> totP <- table(Prod)
> 
> # transactions per values and product for NAs
> nas <- sales[which(is.na(Quant) & is.na(Val)), c("ID", "Prod")]
> 
> # obtain salesppl NAs
> propS <- 100* table(nas$ID)/totS
> propS[order(propS, decreasing=T)[1:10]]

    v1237     v4254     v4038     v5248     v3666     v4433     v4170 
13.793103  9.523810  8.333333  8.333333  6.666667  6.250000  5.555556 
    v4926     v4664     v4642 
 5.555556  5.494505  4.761905 
> #
> # the transactions represent a small proportion of transactions
> # looking at the products:
> propP <- 100* table(nas$Prod)/totP
> propP[order(propP, decreasing=T)[1:10]]

   p2689    p2675    p4061    p2780    p4351    p2686    p2707    p2690 
39.28571 35.41667 25.00000 22.72727 18.18182 16.66667 14.28571 14.08451 
   p2691    p2670 
12.90323 12.76596 
> 
> # several products more than 20% of their transactions would be
> # removed: that is too much, p2689 would have 40% removed
> detach(sales)
> sales <- sales[-which(is.na(sales$Quant) & is.na(sales$Val)),]

Examining the salespeople first, we create a table for the number of transactions by each salesperson. Then we subset the SALES data, pulling the transactions with NAs in "Quant" and "Val", with the information on the "ID" and "Prod" (salesperson ID and product ID). Dividing this table of NA present transactions by the full table of all the transactions, measures the proportion of NAs in each salesperson's transactions. Taking the top 10 salespeople who NAs in their transactions with "order()", we discover that salesperson "v1237" had the highest percentage of transactions with dual NAs, at 13.8%. That percentage is not too high, and all the other salespeople have lower percentages. We can breathe easy, since not one single salesperson had the majority of his or her transaction reports filled with NAs. So if we remove transactions with dual NAs, not one single salesperson will be overly affected.


Investigating the products next, we do the same procedure and create tables for all of the products, for products with NAs, and the percentage of products with NAs with division. Looking at the 10 products with missing values, product "p2689" has nearly 40 of its transactions incomplete. Unlike the NAs grouped by salespeople, if we delete the dual NA transactions, product "p2689", and "p2675" will have over 35% of their transactions removed, and 4 products would have at least 20% removed! Clearly some products have more missing values than others.



Alternatives


There are generally 3 alternatives available to us as options when we encounter missing data. The first is to remove those rows. Second, fill in the missing values using a calculated method, or third, use tools to handle those values. If you work with an industry specific program or have specific handling instructions, then option three would be the best decision. But here, we can only choose from the first two options.

We could impute unit-prices for the missing products, but with product "p2689", we would only have 60% of the data to fill in the 40%. If there are too many transactions removed, we would then join those transactions with ones from similar products for outlier detection tests. The most best option would be to remove those transactions. So "detach()" the SALES data and remove those transactions with both missing quantity and value elements, via sub-setting the SALES data.


Single Unknown Values


Now that we have resolved the unknown values in both quantity and value variables, we refocus onto transactions with one unknown in either variable. There were 888 transactions with both missing, and for the single missing value, there are 13,248 transactions, nearly 15 times as many. Let us first start with the quantity variable, "Quant".

We stratify the quantities by the product code, thus searching for the proportion of NAs present in the quantities of a certain product. We flag a product if a high number of its transactions have missing quantities.

Missing Quantity for Products Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
> # move to unknown value either quantity or value ####
> # how many transactions?
> #
> length(which(is.na(sales$Quant) | is.na(sales$Val)))
[1] 13248
> #
> # check missing quantity first
> nnasQp <- tapply(sales$Quant, list(sales$Prod),
+                  function(x) sum(is.na(x)))
> propNAsQp <- nnasQp/table(sales$Prod)
> propNAsQp[order(propNAsQp, decreasing=T)[1:10]]
    p2442     p2443     p1653     p4101     p4243      p903     p3678 
1.0000000 1.0000000 0.9090909 0.8571429 0.6842105 0.6666667 0.6666667 
    p3955     p4464     p1261 
0.6428571 0.6363636 0.6333333 
> # there are two products with all transactions of unknown
> # quantity, p2442, p2443
> sales[sales$Prod %in% c("p2442", "p2443"),]
          ID  Prod Quant    Val  Insp
21259  v2921 p2442    NA   5715  unkn
21260  v2922 p2442    NA  21250  unkn
21261  v2923 p2442    NA 102210  unkn
21262  v2922 p2442    NA 213155  unkn
21263  v2924 p2442    NA   4870  unkn
21264  v2925 p2443    NA  17155    ok
58422  v2924 p2442    NA   4870  unkn
58423  v2922 p2442    NA  21250  unkn
58424  v4356 p2442    NA  53815  unkn
58425  v2922 p2442    NA 213155  unkn
58426  v2925 p2443    NA   3610 fraud
58427  v4355 p2443    NA   5260  unkn
58428  v4355 p2443    NA   1280 fraud
102076 v2921 p2442    NA 137625  unkn
102077 v2920 p2442    NA  21310  unkn
102078 v4839 p2442    NA   5190  unkn
102079 v4356 p2442    NA  11320  unkn
102080 v2922 p2442    NA  34180  unkn
102081 v2925 p2443    NA   3610  unkn
102082 v4355 p2443    NA   5260  unkn
102083 v4355 p2443    NA   1280  unkn
102084 v2925 p2443    NA   3075  unkn
153543 v5077 p2442    NA   7720  unkn
153544 v2924 p2442    NA   9620  unkn
153545 v2920 p2442    NA  34365  unkn
153546 v2925 p2443    NA   3455  unkn
195784 v5077 p2442    NA   7720  unkn
195785 v4356 p2442    NA  43705  unkn
195786 v2939 p2443    NA   5465  unkn
195787 v2925 p2443    NA  14990  unkn
252153 v2924 p2442    NA   4870  unkn
252154 v2921 p2442    NA 137625  unkn
252155 v5077 p2442    NA   7720  unkn
252156 v2922 p2442    NA  66820  unkn
252157 v5077 p2442    NA  12035  unkn
252158 v2920 p2442    NA  79320  unkn
252159 v2925 p2443    NA   3610  unkn
325280 v2924 p2442    NA   4870  unkn
325281 v2921 p2442    NA 137625  unkn
325282 v5077 p2442    NA   7720  unkn
325283 v2922 p2442    NA  66820  unkn
325284 v5077 p2442    NA  12350  unkn
325285 v5077 p2442    NA  12035  unkn
325286 v2920 p2442    NA  43180  unkn
325289 v2925 p2443    NA   3610  unkn
325290 v4355 p2443    NA   5260  unkn
325291 v4355 p2443    NA   1280  unkn
325292 v2925 p2443    NA   2890  unkn
390840 v5077 p2442    NA  11515  unkn
390841 v4356 p2442    NA   4695  unkn
390842 v2923 p2442    NA  15580  unkn
390843 v2920 p2442    NA  27320  unkn
390844 v6044 p2442    NA  21215  unkn
390845 v4356 p2442    NA  53190  unkn
> # delete them because both have OK and Fraud
> sales <- sales[!sales$Prod %in% c("p2442", "p2443"),]
> # update levels
> #
> nlevels(sales$Prod) # 4548
[1] 4548
> sales$Prod <- factor(sales$Prod)
> nlevels(sales$Prod) # 4846
[1] 4546
> # now has correct number, after we removed the 2 products

Looking at the proportions, product "p2442" and "p2443" lack the quantity metric all of their transactions! Also, "p1653" has 90% missing, and "p4101" has 86% missing. These are quite stark numbers. For those products that lack all of their quantity values, looking at their fraud inspection status, we see some labeled "ok" and some labeled "fraud". Because it is not statistically sound to use those evaluations given the lack of data, we will remove those two problem products. Additionally, remember to update the levels in the "Prod" variable, since we removed "p2442" and "p2443".


Missing Quantity by Salespeople Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
> # check salesppl with transactions of unknown quantity
> nnasQs <- tapply(sales$Quant, list(sales$ID),
+                  function(x) sum(is.na(x)))
> propNAsQs <- nnasQs/table(sales$ID)
> propNAsQs[order(propNAsQs, decreasing=T)[1:10]]
    v2925     v5537     v5836     v6058     v6065     v4368     v2923 
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8888889 0.8750000 
    v2970     v4910     v4542 
0.8571429 0.8333333 0.8095238 
> # quite a few salesppl did not fill out the quantity info
> # but we can use numbers from other sales ppl under unitprice
> # so no need to delete

Here we again take the table of number of missing quantities by salespeople and find the proportion by dividing with the table of salespeople. Then taking the top 10 results, we see that salesperson "v2925", "v5537", "v5836", "v6058", and "v6065", have all of their transactions missing the quantity variable. However, as long as we have transactions of the same product by other people, we can use the unit-price to calculate the quantity from the present value element.


Missing Value by Product Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # check unknown values
> # by product
> nnasVp <- tapply(sales$Val, list(sales$Prod),
+                  function(x) sum(is.na(x)))
> propNAsVp <- nnasVp/table(sales$Prod)
> propNAsVp[order(propNAsVp, decreasing=T)[1:10]]
     p1110      p1022      p4491      p1462        p80      p4307 
0.25000000 0.17647059 0.10000000 0.07500000 0.06250000 0.05882353 
     p4471      p2821      p1017      p4287 
0.05882353 0.05389222 0.05263158 0.05263158 
> # reasonable results, no need to delete

Using the similar pattern of finding the proportion of missing "Val" values for each product, we pleasantly discover no high percentages. Since no product has high NAs proportions, we do not need to delete them.


Missing Value by Salespeople Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # unknown values by salespeople
> #
> nnasVs <- tapply(sales$Val, list(sales$ID),
+ function(x) sum(is.na(x)))
> propNAsVs <- nnasVs/table(sales$ID)
> propNAsVs[order(propNAsVs, decreasing=T)[1:10]]
     v5647        v74      v5946      v5290      v4472      v4022 
0.37500000 0.22222222 0.20000000 0.15384615 0.12500000 0.09756098 
      v975      v2814      v2892      v3739 
0.09574468 0.09090909 0.09090909 0.08333333 
> # reasonable results again

Now examining the missing values in "Val" by salespeople, we observe no salesperson with overly high proportions of NAs involving "Val". We have acceptable results, with no need to delete any more transactions.



Imputing Missing Data


Now that we have removed the transactions with insufficient information, we can fill in the remaining values using our fill-in strategy of relying on the unit-price. Also, we need to skip those transactions previously audited and labeled as "fraud". We utilize the median unit price of transactions as the typical price for their respective products. (I saved the SALES data because we took out invalid transactions, you might want to do the same.) To find the median unit-price without consulting those fraudulent transactions, we specify those transactions with the "Insp" (inspection) variable as not equal to fraud, '!= "fraud" '.

Filling In 'Quant' and 'Val' Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> load("sales.rdata") ####
> # imputing ####
> #
> # calculate median price, while ignoring frauds
> #
> tPrice <- tapply(sales[sales$Insp != "fraud", "Uprice"],
+                  list(sales[sales$Insp != "fraud", "Prod"]),
+                  median, na.rm=T)
> # now we can use median unit-price to calculate Quant and Val
> #
> # we already eliminated transactions with both missing
> #
> # begin with Quant, find missing Quants
> noQuant <- which(is.na(sales$Quant))
> sum(is.na(sales$Quant))
[1] 12900
> # Imputing Quant
> # round Quant up
> sales[noQuant,'Quant'] <- ceiling(sales[noQuant, 'Val']/
+                           tPrice[sales[noQuant,"Prod"]])
> #
> # next Val, find missing Vals
> noVal <- which(is.na(sales$Val))
> sum(is.na(sales$Val))
[1] 294
> # impute Vals
> sales[noVal, 'Val'] <- sales[noVal,'Quant']*
+                        tPrice[sales[noVal, 'Prod']]

We fill in the missing 'Quant' values by creating a missing index, and discover 12,900 transactions ready to be imputed. Then we round all the 'Quant' values we will impute up, since quantity is an integer. Remember, value/qantity = unit-price, so we divide value by the unit-price to obtain the quantity values.


Next we tackle the missing 'Val' values, create the missing 'Val' index, and we find 294 we can fill in. Again, using the simple formula, we multiply the quantity by the unit-price to obtain the value.



Clean Up


Now that we have no more unknown values in 'Quant' or 'Val', we have a complete, or clean dataset. But we are not finished! Since we have all the quantity and values, we can recalculate the unit-price with all the values present. And make sure to save this SALES dataset, in case you have not yet already. Naming the file 'salesClean.rdata' allows us to differentiate the regular SALES set with the clean SALES set.

Recalculating Unit-Price Code:
1
2
3
4
5
6
> # all unit-prices present, so recalculate
> #
> sales$Uprice <- sales$Val/sales$Quant
> # now we have a dataset free of unknowns
> # save our progress!
> save(sales, file='salesClean.rdata')



Summary



Whew! Handling missing data is a necessary task, and when to remove rows entries requires data wrangling experience and reliance/knowledge of the other data present. OK folks, so here we located and quantified the unknown values in the variables, identified which rows (transactions) we needed and which to remove, and removed those while imputing values from the same products using the median unit-price. Sometimes in user-generated data, a variable might not be composed of other variables, so imputation is impossible. Other times, we can infer values based on other variables. Either way, working with 'clean' data is not a privilege, since we usually have to work to refine it!

Thursday, July 24, 2014

Predicting Fraudulent Transactions in R: Part 1. Transactions


Hello Readers,

Today we start a new Case Study Series to audit fraudulent sales transactions. What makes a transaction fraudulent, you might ask? Looking from a statistical standpoint, transactions, or observations would be considered fraudulent if they were unusual in nature, such that they deviate from the norm or arouse suspicion. Various financial instruments such as bank accounts or credit cards can be analyzed for fraudulent transactions.


Fishy Numbers
Our transaction dataset contains information concerning the sales and products from salespersonnel from a certain company. The identities have been wiped from the data to preserve anonymity. The salespeople set their own selling prices and report the sales back to the company at the end of each month. A minor percentage of these records were audited and labeled as "ok" or "fraud". The other records depend on us, the data scientists, for fraud discovery. This situation arises frequently in various real companies, where only a few records can be inspected individually among the many thousands or even millions of records, because of budget and time constraints. So we will build a model from the set of transactions which have been audited, and predict the audited result for the remaining transactions. That way we can automate the auditing and catch more fraudulent transactions, thus saving the company's costs.

The data is available from the "DMwR" package in R. Load it, prepare yourself for fishy numbers, and let's get started.

(This is a series from Luis Torgo's  Data Mining with R book.)


Numerous Transactions


You might have guessed that there are many records in this data set. And you would be correct, and thankfully not millions. Start by loading the library "DMwR" and the subsequent dataset "sales". Then we call the basic "head()" and "summary()" functions to understand the data content.

Data Loading and Preview Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
> # loading the library
> library(DMwR)
Loading required package: lattice
Loading required package: grid
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> 
> # loading the sales dataset
> data(sales)
> 
> # initial exploration
> head(sales)
  ID Prod Quant   Val Insp
1 v1   p1   182  1665 unkn
2 v2   p1  3072  8780 unkn
3 v3   p1 20393 76990 unkn
4 v4   p1   112  1100 unkn
5 v3   p1  6164 20260 unkn
6 v5   p2   104  1155 unkn
> 
> dim(sales)
[1] 401146      5
> 
> summary(sales)
       ID              Prod            Quant                Val         
 v431   : 10159   p1125  :  3923   Min.   :      100   Min.   :   1005  
 v54    :  6017   p3774  :  1824   1st Qu.:      107   1st Qu.:   1345  
 v426   :  3902   p1437  :  1720   Median :      168   Median :   2675  
 v1679  :  3016   p1917  :  1702   Mean   :     8442   Mean   :  14617  
 v1085  :  3001   p4089  :  1598   3rd Qu.:      738   3rd Qu.:   8680  
 v1183  :  2642   p2742  :  1519   Max.   :473883883   Max.   :4642955  
 (Other):372409   (Other):388860   NA's   :13842       NA's   :1182     
    Insp       
 ok   : 14462  
 unkn :385414  
 fraud:  1270  
> 
> # to view levels of sales people and products
> nlevels(sales$ID)
[1] 6016
> nlevels(sales$Prod)
[1] 4548
> 

We see from the output we have 401,146 transactions, with 5 variable columns: "ID", "Prod", "Quant", "Val", and "Insp". The salesperson ID is numbered with a prefix "v" while the product code is prefixed with a "p". Because a salesperson can have multiple transactions with several different products, and products can be sold by different salespeople, we require unique elements from "ID" and "Prod" to gauge the number of salespeople and products available. Using the "nlevels()" function we discover 6,016 salespeople and 4,548 products in the dataset.


From the summary output, we see some missing values in the "Quant" and "Val" columns, in addition to some ridiculous quantity and value figures. We will have to address the sale of 473,883,883 units of product, and the sale of $4,642,955 worth of product. Perhaps a costly typo might explain those large figures, and we deal will with them in the next post, where we deal with missing values. 

If any quantity or value metric is missing in a transaction, it would be difficult to verify that the particular transaction is not fraudulent: we would not be able to tell how many and/or how much a product sold for. A single product could be sold at too low of a price, or many products could be sold for next to nothing, resulting in (fraudulent) losses for the company. So we need to investigate those missing values further.

Tracking Missing Entries Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
> # large number of missing values for quant and value
> length(which(is.na(sales$Quant) & is.na(sales$Val)))
[1] 888
> # or sum(is.na(sales$Quant) & is.na(sales$Val))
> # proportion of sales
> table(sales$Insp)/nrow(sales) * 100

       ok      unkn     fraud 
 3.605171 96.078236  0.316593 
> 
> table(sales$Insp)

    ok   unkn  fraud 
 14462 385414   1270 
> 

Well 888 entries have both the quantity AND values missing! What were those salespeople doing when they filed their reports? Because we cannot confidently impute average product values for those transactions, we hope that they are not designated "ok". In a later post, we will investigate and deal with those transactions with the missing elements, either in "Quant", "Val", or in both.



Tracking Transactions


To achieve further understanding of the sales data, we look at the distribution of transactions per sales person and per product. By doing this graphically, we can see the salespeople and product activity from the available transactions.

Plotting Transactions Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
> ## plotting transactions 
> 
> # reports per sales person
> totS <-table(sales$ID)
>  
> barplot(totS, main="Transactions per salespeople", names.arg="",
+ xlab="Salespeople", ylab="Amount")
> 
> # reports per product
> totP <- table(sales$Prod)
> barplot(totP, main="Transactions per product", names.arg="",
+ xlab="Products", ylab="Amount", ylim=c(0,4000))
> #
> # plots show much variability in transactions
> 

We begin with the salespeople. From the below plot, we see clear evidence of outstanding performers, with some reaching over 10,000 and 6,000 transactions! Suspicious? Maybe. Now from the graph, we understand that most of the salespeople numbered below 1,000 transactions, so we will keep an eye on the prolific sellers of products. Keep note however, that this is just the number of transactions, not the quantity or value of the product sold. Just because a salesperson has more transactions, does not mean he or she brought in more sales revenue.


Figure 1. Transactions per Salespeople
Now we turn to the products. Some products evidently were involved in more transactions that others, with the highest being sold in roughly 4,000 transactions. Another similar caveat- even though that particular product A was sold over 4,000 times, another product B could have been sold in only 400 transactions, but with quantities averaging 1,000 per order. So the product A in the most transactions might not have been the actual best seller, since the plot below does not depict the quantity per transaction.


Figure 2. Transactions per Product


Generating Unit-Price


Because salespeople set the price, and we saw the suspicious maximum values for quantity and value from the "summary()" output: 473,883,883 units of product, and a product that was sold for a total of $4,642,955, we must use another metric compare the product prices. Thus we use the unit-price, where we take the "Val" and divide it by "Quant" to get the price paid per unit for that particular product in that transaction:

Unit-Price, Product Prices, & Salespeople Sales Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
> # create unit price to see difference in prices
> #
> sales$Uprice <- sales$Val/sales$Quant
> head(sales)
  ID Prod Quant   Val Insp    Uprice
1 v1   p1   182  1665 unkn  9.148352
2 v2   p1  3072  8780 unkn  2.858073
3 v3   p1 20393 76990 unkn  3.775315
4 v4   p1   112  1100 unkn  9.821429
5 v3   p1  6164 20260 unkn  3.286827
6 v5   p2   104  1155 unkn 11.105769
> # top expensive and cheap products
> #
> attach(sales)
> # find median price of each product
> upp <- aggregate(Uprice, list(Prod), median, na.rm=TRUE)
> # alternate T,F for ordering up or down
> # get top 5 for each order
> topP <- sapply(c(T,F), function(o)
+                upp[order(upp[,2],decreasing=o)[1:5],1])
> colnames(topP) <- c('Expensive', 'Cheap')
> # most expensive and cheapest products by median unit price
> topP
     Expensive Cheap  
[1,] "p3689"   "p560" 
[2,] "p2453"   "p559" 
[3,] "p2452"   "p4195"
[4,] "p2456"   "p601" 
[5,] "p2459"   "p563" 
> #
> # confirm expensive and cheap products with boxplots
> # use %in% belong in a set of first row in topP
> #
> tops <- sales[Prod %in% topP[1,], c("Prod", "Uprice")]
> # change to 2 levels, from
> tops$Prod <- factor(tops$Prod)
> # uses log scale
> boxplot(Uprice ~ Prod, data=tops, ylab="Uprice", log="y")
> # see big discrepancy in highest and lowest products
> #
> #
> # which salespeople bring in most and least?
> # sum by unique salespeople
> vsales <- aggregate(Val, list(ID), sum, na.rm=TRUE)
> # take top 5 and lowest 5
> scoresSs <- sapply(c(T,F), function(o)
+                    vsales[order(vsales$x, decreasing=o)[1:5],1])
> colnames(scoresSs) <- c("Most", "Least")
> # top and lowest 5 salespeople by value
> scoresSs
     Most    Least  
[1,] "v431"  "v3355"
[2,] "v54"   "v6069"
[3,] "v19"   "v5876"
[4,] "v4520" "v6058"
[5,] "v955"  "v4515"
> 
> # boxplot of best and worst salespeople
> valsales <- sales[ID %in% scoresSs[1,], c("ID", "Val")]
> valsales$ID <- factor(valsales$ID)
> boxplot(Val ~ ID, data=valsales, ylab="Transaction Value", log="y")

By calling "head()" of the "sales" data, we see the first six transactions, and the "Uprice" variable was added to the end of the dataset. Next we "attach()" the dataset so that R adds the dataset to the search path, so we can call the column objects directly.



Unit-Price of Products


We tackle the average unit-price of the various products first. Using the "aggregate()" function, we can apply the "median()" function to find the median unit-price by each product level. Remember to remove the missing values by setting "na.rm=TRUE". Next we use the "sapply()" function to pass TRUE and FALSE arguments to the "order()" function acting on the median product unit-price output. The function takes the first 5 in decreasing order, highest prices, and first 5 in increasing order, so lowest prices. Then we can see the product IDs, the most expensive product per unit-price was "p3689", and the least expensive was "p560".

Next we create a boxplot, shown below, of the those two extreme products, utilizing all of their unit-price transactions. "p3689" eclipses "p560", and by a large margin since the scale is in log.


Figure 3. Cheapest and Most Expensive Product by Unit-Price


Salespeople's Sale Values



Subsequently, in the R code, we turn to the value of sales by salespeople. Take "Val" and sum by "ID" with the "aggregate()" function, while removing the missing values. Similar to the previous unit-price calculation, we use "sapply()" again to obtain the 5 highest and lowest sales to determine the best and worst performing salespeople. Looking at "scoresSs", "v431" performed the best, and "v3355" performed the worst. We see this relationship in the boxplot below:


Figure 4. Salesperson with Highest and Lowest Sales

There is a clear portion of the high value salesperson's transactions that are high value outliers. The majority of "v431's" values were so high compared to "v3355", that "v3355's" transactions appear as a meager line on the log scale.


Sale Values Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
> # diving into sales
> # the top 100 salespeople account for 40% of income
> #
> sum(vsales[order(vsales$x, decreasing=T)[1:100], 2])/sum(Val, na.rm=T)*100
[1] 38.33277
> #
> # while bottom 2k brought in less than 2% of income
> #
> sum(vsales[order(vsales$x, decreasing=F)[1:2000], 2])/sum(Val, na.rm=T)*100
[1] 1.988716

Investigating the value of sales further, we discover from the above R code that the top 100 salespeople accounted for nearly 40% of the total sales value. There is a high discrepancy between the high and low sales performers, as the lowest 2,000 summed to barely 2% of the total sales value.



Product Quantity



Now let's turn to the quantity of product per transaction. Using a similar code format from above, the "aggregate()" function returns the quantity of products grouped by product. Then "sapply()" comes into play again, giving us the highest and lowest 5 quantities for each product. In this dataset, salespeople sold product "p2516" the most, and "p2442" the least.

Quantity of Product Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
> # for quantity sold for each product, more unbalanced
> #
> qs <- aggregate(Quant, list(Prod), sum, na.rm=T)
> scoresPs <- sapply(c(T,F), function(o)
+                    qs[order(qs$x, decreasing=o)[1:5], 1])
> colnames(scoresPs) <- c('most', 'least')
> scoresPs
     most    least  
[1,] "p2516" "p2442"
[2,] "p3599" "p2443"
[3,] "p314"  "p1653"
[4,] "p569"  "p4101"
[5,] "p319"  "p3678"
> 
> # create boxplot of low-high quantity products
> qtop <- sales[Prod %in% scoresPs[1,], c("Prod", "Quant")]
> qtop$Prod <- factor(qtop$Prod)
> boxplot(Quant ~ Prod, data=qtop, ylab=c("Transaction Quantity"),
+         log="y")
> 

For product "p2442", we do not even see the boxplot compared to the boxplot range of "p2516", despite being in log. This clearly shows the disparity between the quantities of product sold. Just look at the y-axis range: 1e+02 to 1e+08!



Figure 5. Lowest and Highest Quantity of Product

When we took the top 100 salespeople, they accounted for nearly 40% of the value generated. But for the quantity of product sold, again, the disparity is stark. For the 100 products sold, they consist of 75% of the total quantity. And for the lowest 4,000 products, they barely share 9% of the quantity.


More Quantity Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
> # of 4548 products
> # top 100 products about 75% of quantity
> sum(as.double(qs[order(qs$x, decreasing=T)[1:100], 2]))/
+ sum(as.double(Quant), na.rm=T) * 100
[1] 74.63478
> #
> # lowest 4k products less than 9% of quantity
> #
> sum(as.double(qs[order(qs$x, decreasing=F)[1:4000], 2]))/
+ sum(as.double(Quant), na.rm=T) * 100
[1] 8.944681
> # cannot draw conclusions about profit margins so lower
> # quantity might not be a bad thing

Even if the quantity is low, the salesperson could have sold it at a high value for a great profit margin. So low quantity does not necessarily mean low value, while we look at one variable at a time.



Outliers


When we look up at the stars, we marvel how far away they are. Similar values in this dataset appear to be as far away from the bulk of the data, as the stars are to us. Looking at boxplot outliers is a good starting point. Specifically, looking at values outside the whiskers, which are 1.5+/- the inner quartile range (IQR) will give us a measure of those outliers.


Unit-Price Outliers Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
> # the unit price should follow a near-normal distribution
> #
> # number of outliers from boxplot whiskers
> out <- tapply(Uprice, list(Prod=Prod),
+               function(x) length(boxplot.stats(x)$out))
> #
> # top 10 products which have outlying values
> #
> out[order(out, decreasing=T)[1:10]]
Prod
p1125 p1437 p2273 p1917 p1918 p4089  p538 p3774 p2742 p3338 
  376   181   165   156   156   137   129   125   120   117 
> # percentage of total outlying values
> sum(out)/nrow(sales)*100
[1] 7.34047
> # technically we should only inspect the OK data, but it is only
> # 3.6% of the total data, and we do not know the true values of
> # the fradulent data: risk will always exist
> 

Using "tapply()" to perform a function by groups, we take the number of outliers from each unique product. Looking at the top 10 products with outliers, "p1125" has 376, "p1437" has 181, as so on. When we divide total outliers by the number of transactions, we find that the outliers only consist of 7.3% of the data.


Now we perform the same analysis for the quantity of product:


Quantity Outliers Code:
1
2
3
4
5
6
7
8
9
> # outliers for quantity of product
> q.out <- tapply(Quant, list(Prod=Prod),
+ function(x) length(boxplot.stats(x)$out))
> 
> # top 10 quantity outiers
> q.out[order(q.out, decreasing=T)[1:10]]
Prod
p1125 p3774 p2742 p4089 p1917 p1437 p2273 p1938 p1918 p2957 
  541   244   234   217   216   211   199   197   175   163 

We see the product with the most quantity outliers is "p1125", the same culprit as the highest unit-price outlier. Keeping product "p1125" in mind for further cleaning and analysis, we now proceed to handling the missing data.



Summary


OK folks, I know, that's quite a lot of information to process at once, so I will stop here. In Part 1, we examined our SALES dataset, teasing apart the distributions of the product transactions, values and quantities. Doing so, we discovered that the value and quantities are heavily skewed, along with the number of transactions per salespeople. Some products and salespeople bring in the majority of the value and transactions. However the data is not ready for model creation, as there are certain values which are missing. Transactions with one missing value we can impute, but some have both the "Val" and "Quant" missing. So Part 2 will deal with what we data scientist 'love' the most: data cleaning,  and missing data. So stay tuned!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn


Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.