Hello Readers,
Today we start a new Case Study Series to audit fraudulent sales transactions. What makes a transaction fraudulent, you might ask? Looking from a statistical standpoint, transactions, or observations would be considered fraudulent if they were unusual in nature, such that they deviate from the norm or arouse suspicion. Various financial instruments such as bank accounts or credit cards can be analyzed for fraudulent transactions.
Fishy Numbers |
The data is available from the "DMwR" package in R. Load it, prepare yourself for fishy numbers, and let's get started.
(This is a series from Luis Torgo's Data Mining with R book.)
Numerous Transactions
You might have guessed that there are many records in this data set. And you would be correct, and thankfully not millions. Start by loading the library "DMwR" and the subsequent dataset "sales". Then we call the basic "head()" and "summary()" functions to understand the data content.
Data Loading and Preview Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | > # loading the library > library(DMwR) Loading required package: lattice Loading required package: grid KernSmooth 2.23 loaded Copyright M. P. Wand 1997-2009 > > # loading the sales dataset > data(sales) > > # initial exploration > head(sales) ID Prod Quant Val Insp 1 v1 p1 182 1665 unkn 2 v2 p1 3072 8780 unkn 3 v3 p1 20393 76990 unkn 4 v4 p1 112 1100 unkn 5 v3 p1 6164 20260 unkn 6 v5 p2 104 1155 unkn > > dim(sales) [1] 401146 5 > > summary(sales) ID Prod Quant Val v431 : 10159 p1125 : 3923 Min. : 100 Min. : 1005 v54 : 6017 p3774 : 1824 1st Qu.: 107 1st Qu.: 1345 v426 : 3902 p1437 : 1720 Median : 168 Median : 2675 v1679 : 3016 p1917 : 1702 Mean : 8442 Mean : 14617 v1085 : 3001 p4089 : 1598 3rd Qu.: 738 3rd Qu.: 8680 v1183 : 2642 p2742 : 1519 Max. :473883883 Max. :4642955 (Other):372409 (Other):388860 NA's :13842 NA's :1182 Insp ok : 14462 unkn :385414 fraud: 1270 > > # to view levels of sales people and products > nlevels(sales$ID) [1] 6016 > nlevels(sales$Prod) [1] 4548 > |
We see from the output we have 401,146 transactions, with 5 variable columns: "ID", "Prod", "Quant", "Val", and "Insp". The salesperson ID is numbered with a prefix "v" while the product code is prefixed with a "p". Because a salesperson can have multiple transactions with several different products, and products can be sold by different salespeople, we require unique elements from "ID" and "Prod" to gauge the number of salespeople and products available. Using the "nlevels()" function we discover 6,016 salespeople and 4,548 products in the dataset.
From the summary output, we see some missing values in the "Quant" and "Val" columns, in addition to some ridiculous quantity and value figures. We will have to address the sale of 473,883,883 units of product, and the sale of $4,642,955 worth of product. Perhaps a costly typo might explain those large figures, and we deal will with them in the next post, where we deal with missing values.
If any quantity or value metric is missing in a transaction, it would be difficult to verify that the particular transaction is not fraudulent: we would not be able to tell how many and/or how much a product sold for. A single product could be sold at too low of a price, or many products could be sold for next to nothing, resulting in (fraudulent) losses for the company. So we need to investigate those missing values further.
Tracking Missing Entries Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | > # large number of missing values for quant and value > length(which(is.na(sales$Quant) & is.na(sales$Val))) [1] 888 > # or sum(is.na(sales$Quant) & is.na(sales$Val)) > # proportion of sales > table(sales$Insp)/nrow(sales) * 100 ok unkn fraud 3.605171 96.078236 0.316593 > > table(sales$Insp) ok unkn fraud 14462 385414 1270 > |
Well 888 entries have both the quantity AND values missing! What were those salespeople doing when they filed their reports? Because we cannot confidently impute average product values for those transactions, we hope that they are not designated "ok". In a later post, we will investigate and deal with those transactions with the missing elements, either in "Quant", "Val", or in both.
Tracking Transactions
To achieve further understanding of the sales data, we look at the distribution of transactions per sales person and per product. By doing this graphically, we can see the salespeople and product activity from the available transactions.
Plotting Transactions Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | > ## plotting transactions > > # reports per sales person > totS <-table(sales$ID) > > barplot(totS, main="Transactions per salespeople", names.arg="", + xlab="Salespeople", ylab="Amount") > > # reports per product > totP <- table(sales$Prod) > barplot(totP, main="Transactions per product", names.arg="", + xlab="Products", ylab="Amount", ylim=c(0,4000)) > # > # plots show much variability in transactions > |
We begin with the salespeople. From the below plot, we see clear evidence of outstanding performers, with some reaching over 10,000 and 6,000 transactions! Suspicious? Maybe. Now from the graph, we understand that most of the salespeople numbered below 1,000 transactions, so we will keep an eye on the prolific sellers of products. Keep note however, that this is just the number of transactions, not the quantity or value of the product sold. Just because a salesperson has more transactions, does not mean he or she brought in more sales revenue.
Figure 1. Transactions per Salespeople |
Figure 2. Transactions per Product |
Generating Unit-Price
Because salespeople set the price, and we saw the suspicious maximum values for quantity and value from the "summary()" output: 473,883,883 units of product, and a product that was sold for a total of $4,642,955, we must use another metric compare the product prices. Thus we use the unit-price, where we take the "Val" and divide it by "Quant" to get the price paid per unit for that particular product in that transaction:
Unit-Price, Product Prices, & Salespeople Sales Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | > # create unit price to see difference in prices > # > sales$Uprice <- sales$Val/sales$Quant > head(sales) ID Prod Quant Val Insp Uprice 1 v1 p1 182 1665 unkn 9.148352 2 v2 p1 3072 8780 unkn 2.858073 3 v3 p1 20393 76990 unkn 3.775315 4 v4 p1 112 1100 unkn 9.821429 5 v3 p1 6164 20260 unkn 3.286827 6 v5 p2 104 1155 unkn 11.105769 > # top expensive and cheap products > # > attach(sales) > # find median price of each product > upp <- aggregate(Uprice, list(Prod), median, na.rm=TRUE) > # alternate T,F for ordering up or down > # get top 5 for each order > topP <- sapply(c(T,F), function(o) + upp[order(upp[,2],decreasing=o)[1:5],1]) > colnames(topP) <- c('Expensive', 'Cheap') > # most expensive and cheapest products by median unit price > topP Expensive Cheap [1,] "p3689" "p560" [2,] "p2453" "p559" [3,] "p2452" "p4195" [4,] "p2456" "p601" [5,] "p2459" "p563" > # > # confirm expensive and cheap products with boxplots > # use %in% belong in a set of first row in topP > # > tops <- sales[Prod %in% topP[1,], c("Prod", "Uprice")] > # change to 2 levels, from > tops$Prod <- factor(tops$Prod) > # uses log scale > boxplot(Uprice ~ Prod, data=tops, ylab="Uprice", log="y") > # see big discrepancy in highest and lowest products > # > # > # which salespeople bring in most and least? > # sum by unique salespeople > vsales <- aggregate(Val, list(ID), sum, na.rm=TRUE) > # take top 5 and lowest 5 > scoresSs <- sapply(c(T,F), function(o) + vsales[order(vsales$x, decreasing=o)[1:5],1]) > colnames(scoresSs) <- c("Most", "Least") > # top and lowest 5 salespeople by value > scoresSs Most Least [1,] "v431" "v3355" [2,] "v54" "v6069" [3,] "v19" "v5876" [4,] "v4520" "v6058" [5,] "v955" "v4515" > > # boxplot of best and worst salespeople > valsales <- sales[ID %in% scoresSs[1,], c("ID", "Val")] > valsales$ID <- factor(valsales$ID) > boxplot(Val ~ ID, data=valsales, ylab="Transaction Value", log="y") |
By calling "head()" of the "sales" data, we see the first six transactions, and the "Uprice" variable was added to the end of the dataset. Next we "attach()" the dataset so that R adds the dataset to the search path, so we can call the column objects directly.
Unit-Price of Products
We tackle the average unit-price of the various products first. Using the "aggregate()" function, we can apply the "median()" function to find the median unit-price by each product level. Remember to remove the missing values by setting "na.rm=TRUE". Next we use the "sapply()" function to pass TRUE and FALSE arguments to the "order()" function acting on the median product unit-price output. The function takes the first 5 in decreasing order, highest prices, and first 5 in increasing order, so lowest prices. Then we can see the product IDs, the most expensive product per unit-price was "p3689", and the least expensive was "p560".
Next we create a boxplot, shown below, of the those two extreme products, utilizing all of their unit-price transactions. "p3689" eclipses "p560", and by a large margin since the scale is in log.
Figure 3. Cheapest and Most Expensive Product by Unit-Price |
Salespeople's Sale Values
Subsequently, in the R code, we turn to the value of sales by salespeople. Take "Val" and sum by "ID" with the "aggregate()" function, while removing the missing values. Similar to the previous unit-price calculation, we use "sapply()" again to obtain the 5 highest and lowest sales to determine the best and worst performing salespeople. Looking at "scoresSs", "v431" performed the best, and "v3355" performed the worst. We see this relationship in the boxplot below:
Figure 4. Salesperson with Highest and Lowest Sales |
There is a clear portion of the high value salesperson's transactions that are high value outliers. The majority of "v431's" values were so high compared to "v3355", that "v3355's" transactions appear as a meager line on the log scale.
Sale Values Code:
1 2 3 4 5 6 7 8 9 10 | > # diving into sales > # the top 100 salespeople account for 40% of income > # > sum(vsales[order(vsales$x, decreasing=T)[1:100], 2])/sum(Val, na.rm=T)*100 [1] 38.33277 > # > # while bottom 2k brought in less than 2% of income > # > sum(vsales[order(vsales$x, decreasing=F)[1:2000], 2])/sum(Val, na.rm=T)*100 [1] 1.988716 |
Investigating the value of sales further, we discover from the above R code that the top 100 salespeople accounted for nearly 40% of the total sales value. There is a high discrepancy between the high and low sales performers, as the lowest 2,000 summed to barely 2% of the total sales value.
Product Quantity
Now let's turn to the quantity of product per transaction. Using a similar code format from above, the "aggregate()" function returns the quantity of products grouped by product. Then "sapply()" comes into play again, giving us the highest and lowest 5 quantities for each product. In this dataset, salespeople sold product "p2516" the most, and "p2442" the least.
Quantity of Product Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | > # for quantity sold for each product, more unbalanced > # > qs <- aggregate(Quant, list(Prod), sum, na.rm=T) > scoresPs <- sapply(c(T,F), function(o) + qs[order(qs$x, decreasing=o)[1:5], 1]) > colnames(scoresPs) <- c('most', 'least') > scoresPs most least [1,] "p2516" "p2442" [2,] "p3599" "p2443" [3,] "p314" "p1653" [4,] "p569" "p4101" [5,] "p319" "p3678" > > # create boxplot of low-high quantity products > qtop <- sales[Prod %in% scoresPs[1,], c("Prod", "Quant")] > qtop$Prod <- factor(qtop$Prod) > boxplot(Quant ~ Prod, data=qtop, ylab=c("Transaction Quantity"), + log="y") > |
For product "p2442", we do not even see the boxplot compared to the boxplot range of "p2516", despite being in log. This clearly shows the disparity between the quantities of product sold. Just look at the y-axis range: 1e+02 to 1e+08!
Figure 5. Lowest and Highest Quantity of Product |
When we took the top 100 salespeople, they accounted for nearly 40% of the value generated. But for the quantity of product sold, again, the disparity is stark. For the 100 products sold, they consist of 75% of the total quantity. And for the lowest 4,000 products, they barely share 9% of the quantity.
More Quantity Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 | > # of 4548 products > # top 100 products about 75% of quantity > sum(as.double(qs[order(qs$x, decreasing=T)[1:100], 2]))/ + sum(as.double(Quant), na.rm=T) * 100 [1] 74.63478 > # > # lowest 4k products less than 9% of quantity > # > sum(as.double(qs[order(qs$x, decreasing=F)[1:4000], 2]))/ + sum(as.double(Quant), na.rm=T) * 100 [1] 8.944681 > # cannot draw conclusions about profit margins so lower > # quantity might not be a bad thing |
Even if the quantity is low, the salesperson could have sold it at a high value for a great profit margin. So low quantity does not necessarily mean low value, while we look at one variable at a time.
Outliers
When we look up at the stars, we marvel how far away they are. Similar values in this dataset appear to be as far away from the bulk of the data, as the stars are to us. Looking at boxplot outliers is a good starting point. Specifically, looking at values outside the whiskers, which are 1.5+/- the inner quartile range (IQR) will give us a measure of those outliers.
Unit-Price Outliers Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | > # the unit price should follow a near-normal distribution > # > # number of outliers from boxplot whiskers > out <- tapply(Uprice, list(Prod=Prod), + function(x) length(boxplot.stats(x)$out)) > # > # top 10 products which have outlying values > # > out[order(out, decreasing=T)[1:10]] Prod p1125 p1437 p2273 p1917 p1918 p4089 p538 p3774 p2742 p3338 376 181 165 156 156 137 129 125 120 117 > # percentage of total outlying values > sum(out)/nrow(sales)*100 [1] 7.34047 > # technically we should only inspect the OK data, but it is only > # 3.6% of the total data, and we do not know the true values of > # the fradulent data: risk will always exist > |
Using "tapply()" to perform a function by groups, we take the number of outliers from each unique product. Looking at the top 10 products with outliers, "p1125" has 376, "p1437" has 181, as so on. When we divide total outliers by the number of transactions, we find that the outliers only consist of 7.3% of the data.
Now we perform the same analysis for the quantity of product:
Quantity Outliers Code:
1 2 3 4 5 6 7 8 9 | > # outliers for quantity of product > q.out <- tapply(Quant, list(Prod=Prod), + function(x) length(boxplot.stats(x)$out)) > > # top 10 quantity outiers > q.out[order(q.out, decreasing=T)[1:10]] Prod p1125 p3774 p2742 p4089 p1917 p1437 p2273 p1938 p1918 p2957 541 244 234 217 216 211 199 197 175 163 |
We see the product with the most quantity outliers is "p1125", the same culprit as the highest unit-price outlier. Keeping product "p1125" in mind for further cleaning and analysis, we now proceed to handling the missing data.
Summary
OK folks, I know, that's quite a lot of information to process at once, so I will stop here. In Part 1, we examined our SALES dataset, teasing apart the distributions of the product transactions, values and quantities. Doing so, we discovered that the value and quantities are heavily skewed, along with the number of transactions per salespeople. Some products and salespeople bring in the majority of the value and transactions. However the data is not ready for model creation, as there are certain values which are missing. Transactions with one missing value we can impute, but some have both the "Val" and "Quant" missing. So Part 2 will deal with what we data scientist 'love' the most: data cleaning, and missing data. So stay tuned!
Thanks for reading,
Wayne
@beyondvalence
Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.
Really cool post, highly informative and professionally written and I am glad to be a visitor of this perfect blog, thank you for this rare info!
ReplyDeleteData Science Online Training|
R Programming Online Training|
Hadoop Online Training
Thanks for your informative blog!!! Keep on updating your with such awesome information.
ReplyDeleteData Science Online Training|
R Programming Online Training|
Hadoop Online Training
Thanks for your informative blog!!! Keep on updating your with such awesome information.
ReplyDeleteData Science Online Training|
R Programming Online Training|
Hadoop Online Training
Thank you so much for your best effort.
ReplyDeleteGreat blog.you put Good stuff.All the topics were explained briefly.so quickly understand for me.I am waiting for your next fantastic blog.Thanks for sharing.Any coures related details learn..
ReplyDeleteHadoop Training in Marathahalli|
Hadoop Training in Bangalore|
Data science training in Marathahalli|
Data science training in Bangalore|
nice post.tableau online training in hyderabad
ReplyDelete