Pages

Thursday, July 24, 2014

Predicting Fraudulent Transactions in R: Part 1. Transactions


Hello Readers,

Today we start a new Case Study Series to audit fraudulent sales transactions. What makes a transaction fraudulent, you might ask? Looking from a statistical standpoint, transactions, or observations would be considered fraudulent if they were unusual in nature, such that they deviate from the norm or arouse suspicion. Various financial instruments such as bank accounts or credit cards can be analyzed for fraudulent transactions.


Fishy Numbers
Our transaction dataset contains information concerning the sales and products from salespersonnel from a certain company. The identities have been wiped from the data to preserve anonymity. The salespeople set their own selling prices and report the sales back to the company at the end of each month. A minor percentage of these records were audited and labeled as "ok" or "fraud". The other records depend on us, the data scientists, for fraud discovery. This situation arises frequently in various real companies, where only a few records can be inspected individually among the many thousands or even millions of records, because of budget and time constraints. So we will build a model from the set of transactions which have been audited, and predict the audited result for the remaining transactions. That way we can automate the auditing and catch more fraudulent transactions, thus saving the company's costs.

The data is available from the "DMwR" package in R. Load it, prepare yourself for fishy numbers, and let's get started.

(This is a series from Luis Torgo's  Data Mining with R book.)


Numerous Transactions


You might have guessed that there are many records in this data set. And you would be correct, and thankfully not millions. Start by loading the library "DMwR" and the subsequent dataset "sales". Then we call the basic "head()" and "summary()" functions to understand the data content.

Data Loading and Preview Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
> # loading the library
> library(DMwR)
Loading required package: lattice
Loading required package: grid
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> 
> # loading the sales dataset
> data(sales)
> 
> # initial exploration
> head(sales)
  ID Prod Quant   Val Insp
1 v1   p1   182  1665 unkn
2 v2   p1  3072  8780 unkn
3 v3   p1 20393 76990 unkn
4 v4   p1   112  1100 unkn
5 v3   p1  6164 20260 unkn
6 v5   p2   104  1155 unkn
> 
> dim(sales)
[1] 401146      5
> 
> summary(sales)
       ID              Prod            Quant                Val         
 v431   : 10159   p1125  :  3923   Min.   :      100   Min.   :   1005  
 v54    :  6017   p3774  :  1824   1st Qu.:      107   1st Qu.:   1345  
 v426   :  3902   p1437  :  1720   Median :      168   Median :   2675  
 v1679  :  3016   p1917  :  1702   Mean   :     8442   Mean   :  14617  
 v1085  :  3001   p4089  :  1598   3rd Qu.:      738   3rd Qu.:   8680  
 v1183  :  2642   p2742  :  1519   Max.   :473883883   Max.   :4642955  
 (Other):372409   (Other):388860   NA's   :13842       NA's   :1182     
    Insp       
 ok   : 14462  
 unkn :385414  
 fraud:  1270  
> 
> # to view levels of sales people and products
> nlevels(sales$ID)
[1] 6016
> nlevels(sales$Prod)
[1] 4548
> 

We see from the output we have 401,146 transactions, with 5 variable columns: "ID", "Prod", "Quant", "Val", and "Insp". The salesperson ID is numbered with a prefix "v" while the product code is prefixed with a "p". Because a salesperson can have multiple transactions with several different products, and products can be sold by different salespeople, we require unique elements from "ID" and "Prod" to gauge the number of salespeople and products available. Using the "nlevels()" function we discover 6,016 salespeople and 4,548 products in the dataset.


From the summary output, we see some missing values in the "Quant" and "Val" columns, in addition to some ridiculous quantity and value figures. We will have to address the sale of 473,883,883 units of product, and the sale of $4,642,955 worth of product. Perhaps a costly typo might explain those large figures, and we deal will with them in the next post, where we deal with missing values. 

If any quantity or value metric is missing in a transaction, it would be difficult to verify that the particular transaction is not fraudulent: we would not be able to tell how many and/or how much a product sold for. A single product could be sold at too low of a price, or many products could be sold for next to nothing, resulting in (fraudulent) losses for the company. So we need to investigate those missing values further.

Tracking Missing Entries Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
> # large number of missing values for quant and value
> length(which(is.na(sales$Quant) & is.na(sales$Val)))
[1] 888
> # or sum(is.na(sales$Quant) & is.na(sales$Val))
> # proportion of sales
> table(sales$Insp)/nrow(sales) * 100

       ok      unkn     fraud 
 3.605171 96.078236  0.316593 
> 
> table(sales$Insp)

    ok   unkn  fraud 
 14462 385414   1270 
> 

Well 888 entries have both the quantity AND values missing! What were those salespeople doing when they filed their reports? Because we cannot confidently impute average product values for those transactions, we hope that they are not designated "ok". In a later post, we will investigate and deal with those transactions with the missing elements, either in "Quant", "Val", or in both.



Tracking Transactions


To achieve further understanding of the sales data, we look at the distribution of transactions per sales person and per product. By doing this graphically, we can see the salespeople and product activity from the available transactions.

Plotting Transactions Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
> ## plotting transactions 
> 
> # reports per sales person
> totS <-table(sales$ID)
>  
> barplot(totS, main="Transactions per salespeople", names.arg="",
+ xlab="Salespeople", ylab="Amount")
> 
> # reports per product
> totP <- table(sales$Prod)
> barplot(totP, main="Transactions per product", names.arg="",
+ xlab="Products", ylab="Amount", ylim=c(0,4000))
> #
> # plots show much variability in transactions
> 

We begin with the salespeople. From the below plot, we see clear evidence of outstanding performers, with some reaching over 10,000 and 6,000 transactions! Suspicious? Maybe. Now from the graph, we understand that most of the salespeople numbered below 1,000 transactions, so we will keep an eye on the prolific sellers of products. Keep note however, that this is just the number of transactions, not the quantity or value of the product sold. Just because a salesperson has more transactions, does not mean he or she brought in more sales revenue.


Figure 1. Transactions per Salespeople
Now we turn to the products. Some products evidently were involved in more transactions that others, with the highest being sold in roughly 4,000 transactions. Another similar caveat- even though that particular product A was sold over 4,000 times, another product B could have been sold in only 400 transactions, but with quantities averaging 1,000 per order. So the product A in the most transactions might not have been the actual best seller, since the plot below does not depict the quantity per transaction.


Figure 2. Transactions per Product


Generating Unit-Price


Because salespeople set the price, and we saw the suspicious maximum values for quantity and value from the "summary()" output: 473,883,883 units of product, and a product that was sold for a total of $4,642,955, we must use another metric compare the product prices. Thus we use the unit-price, where we take the "Val" and divide it by "Quant" to get the price paid per unit for that particular product in that transaction:

Unit-Price, Product Prices, & Salespeople Sales Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
> # create unit price to see difference in prices
> #
> sales$Uprice <- sales$Val/sales$Quant
> head(sales)
  ID Prod Quant   Val Insp    Uprice
1 v1   p1   182  1665 unkn  9.148352
2 v2   p1  3072  8780 unkn  2.858073
3 v3   p1 20393 76990 unkn  3.775315
4 v4   p1   112  1100 unkn  9.821429
5 v3   p1  6164 20260 unkn  3.286827
6 v5   p2   104  1155 unkn 11.105769
> # top expensive and cheap products
> #
> attach(sales)
> # find median price of each product
> upp <- aggregate(Uprice, list(Prod), median, na.rm=TRUE)
> # alternate T,F for ordering up or down
> # get top 5 for each order
> topP <- sapply(c(T,F), function(o)
+                upp[order(upp[,2],decreasing=o)[1:5],1])
> colnames(topP) <- c('Expensive', 'Cheap')
> # most expensive and cheapest products by median unit price
> topP
     Expensive Cheap  
[1,] "p3689"   "p560" 
[2,] "p2453"   "p559" 
[3,] "p2452"   "p4195"
[4,] "p2456"   "p601" 
[5,] "p2459"   "p563" 
> #
> # confirm expensive and cheap products with boxplots
> # use %in% belong in a set of first row in topP
> #
> tops <- sales[Prod %in% topP[1,], c("Prod", "Uprice")]
> # change to 2 levels, from
> tops$Prod <- factor(tops$Prod)
> # uses log scale
> boxplot(Uprice ~ Prod, data=tops, ylab="Uprice", log="y")
> # see big discrepancy in highest and lowest products
> #
> #
> # which salespeople bring in most and least?
> # sum by unique salespeople
> vsales <- aggregate(Val, list(ID), sum, na.rm=TRUE)
> # take top 5 and lowest 5
> scoresSs <- sapply(c(T,F), function(o)
+                    vsales[order(vsales$x, decreasing=o)[1:5],1])
> colnames(scoresSs) <- c("Most", "Least")
> # top and lowest 5 salespeople by value
> scoresSs
     Most    Least  
[1,] "v431"  "v3355"
[2,] "v54"   "v6069"
[3,] "v19"   "v5876"
[4,] "v4520" "v6058"
[5,] "v955"  "v4515"
> 
> # boxplot of best and worst salespeople
> valsales <- sales[ID %in% scoresSs[1,], c("ID", "Val")]
> valsales$ID <- factor(valsales$ID)
> boxplot(Val ~ ID, data=valsales, ylab="Transaction Value", log="y")

By calling "head()" of the "sales" data, we see the first six transactions, and the "Uprice" variable was added to the end of the dataset. Next we "attach()" the dataset so that R adds the dataset to the search path, so we can call the column objects directly.



Unit-Price of Products


We tackle the average unit-price of the various products first. Using the "aggregate()" function, we can apply the "median()" function to find the median unit-price by each product level. Remember to remove the missing values by setting "na.rm=TRUE". Next we use the "sapply()" function to pass TRUE and FALSE arguments to the "order()" function acting on the median product unit-price output. The function takes the first 5 in decreasing order, highest prices, and first 5 in increasing order, so lowest prices. Then we can see the product IDs, the most expensive product per unit-price was "p3689", and the least expensive was "p560".

Next we create a boxplot, shown below, of the those two extreme products, utilizing all of their unit-price transactions. "p3689" eclipses "p560", and by a large margin since the scale is in log.


Figure 3. Cheapest and Most Expensive Product by Unit-Price


Salespeople's Sale Values



Subsequently, in the R code, we turn to the value of sales by salespeople. Take "Val" and sum by "ID" with the "aggregate()" function, while removing the missing values. Similar to the previous unit-price calculation, we use "sapply()" again to obtain the 5 highest and lowest sales to determine the best and worst performing salespeople. Looking at "scoresSs", "v431" performed the best, and "v3355" performed the worst. We see this relationship in the boxplot below:


Figure 4. Salesperson with Highest and Lowest Sales

There is a clear portion of the high value salesperson's transactions that are high value outliers. The majority of "v431's" values were so high compared to "v3355", that "v3355's" transactions appear as a meager line on the log scale.


Sale Values Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
> # diving into sales
> # the top 100 salespeople account for 40% of income
> #
> sum(vsales[order(vsales$x, decreasing=T)[1:100], 2])/sum(Val, na.rm=T)*100
[1] 38.33277
> #
> # while bottom 2k brought in less than 2% of income
> #
> sum(vsales[order(vsales$x, decreasing=F)[1:2000], 2])/sum(Val, na.rm=T)*100
[1] 1.988716

Investigating the value of sales further, we discover from the above R code that the top 100 salespeople accounted for nearly 40% of the total sales value. There is a high discrepancy between the high and low sales performers, as the lowest 2,000 summed to barely 2% of the total sales value.



Product Quantity



Now let's turn to the quantity of product per transaction. Using a similar code format from above, the "aggregate()" function returns the quantity of products grouped by product. Then "sapply()" comes into play again, giving us the highest and lowest 5 quantities for each product. In this dataset, salespeople sold product "p2516" the most, and "p2442" the least.

Quantity of Product Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
> # for quantity sold for each product, more unbalanced
> #
> qs <- aggregate(Quant, list(Prod), sum, na.rm=T)
> scoresPs <- sapply(c(T,F), function(o)
+                    qs[order(qs$x, decreasing=o)[1:5], 1])
> colnames(scoresPs) <- c('most', 'least')
> scoresPs
     most    least  
[1,] "p2516" "p2442"
[2,] "p3599" "p2443"
[3,] "p314"  "p1653"
[4,] "p569"  "p4101"
[5,] "p319"  "p3678"
> 
> # create boxplot of low-high quantity products
> qtop <- sales[Prod %in% scoresPs[1,], c("Prod", "Quant")]
> qtop$Prod <- factor(qtop$Prod)
> boxplot(Quant ~ Prod, data=qtop, ylab=c("Transaction Quantity"),
+         log="y")
> 

For product "p2442", we do not even see the boxplot compared to the boxplot range of "p2516", despite being in log. This clearly shows the disparity between the quantities of product sold. Just look at the y-axis range: 1e+02 to 1e+08!



Figure 5. Lowest and Highest Quantity of Product

When we took the top 100 salespeople, they accounted for nearly 40% of the value generated. But for the quantity of product sold, again, the disparity is stark. For the 100 products sold, they consist of 75% of the total quantity. And for the lowest 4,000 products, they barely share 9% of the quantity.


More Quantity Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
> # of 4548 products
> # top 100 products about 75% of quantity
> sum(as.double(qs[order(qs$x, decreasing=T)[1:100], 2]))/
+ sum(as.double(Quant), na.rm=T) * 100
[1] 74.63478
> #
> # lowest 4k products less than 9% of quantity
> #
> sum(as.double(qs[order(qs$x, decreasing=F)[1:4000], 2]))/
+ sum(as.double(Quant), na.rm=T) * 100
[1] 8.944681
> # cannot draw conclusions about profit margins so lower
> # quantity might not be a bad thing

Even if the quantity is low, the salesperson could have sold it at a high value for a great profit margin. So low quantity does not necessarily mean low value, while we look at one variable at a time.



Outliers


When we look up at the stars, we marvel how far away they are. Similar values in this dataset appear to be as far away from the bulk of the data, as the stars are to us. Looking at boxplot outliers is a good starting point. Specifically, looking at values outside the whiskers, which are 1.5+/- the inner quartile range (IQR) will give us a measure of those outliers.


Unit-Price Outliers Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
> # the unit price should follow a near-normal distribution
> #
> # number of outliers from boxplot whiskers
> out <- tapply(Uprice, list(Prod=Prod),
+               function(x) length(boxplot.stats(x)$out))
> #
> # top 10 products which have outlying values
> #
> out[order(out, decreasing=T)[1:10]]
Prod
p1125 p1437 p2273 p1917 p1918 p4089  p538 p3774 p2742 p3338 
  376   181   165   156   156   137   129   125   120   117 
> # percentage of total outlying values
> sum(out)/nrow(sales)*100
[1] 7.34047
> # technically we should only inspect the OK data, but it is only
> # 3.6% of the total data, and we do not know the true values of
> # the fradulent data: risk will always exist
> 

Using "tapply()" to perform a function by groups, we take the number of outliers from each unique product. Looking at the top 10 products with outliers, "p1125" has 376, "p1437" has 181, as so on. When we divide total outliers by the number of transactions, we find that the outliers only consist of 7.3% of the data.


Now we perform the same analysis for the quantity of product:


Quantity Outliers Code:
1
2
3
4
5
6
7
8
9
> # outliers for quantity of product
> q.out <- tapply(Quant, list(Prod=Prod),
+ function(x) length(boxplot.stats(x)$out))
> 
> # top 10 quantity outiers
> q.out[order(q.out, decreasing=T)[1:10]]
Prod
p1125 p3774 p2742 p4089 p1917 p1437 p2273 p1938 p1918 p2957 
  541   244   234   217   216   211   199   197   175   163 

We see the product with the most quantity outliers is "p1125", the same culprit as the highest unit-price outlier. Keeping product "p1125" in mind for further cleaning and analysis, we now proceed to handling the missing data.



Summary


OK folks, I know, that's quite a lot of information to process at once, so I will stop here. In Part 1, we examined our SALES dataset, teasing apart the distributions of the product transactions, values and quantities. Doing so, we discovered that the value and quantities are heavily skewed, along with the number of transactions per salespeople. Some products and salespeople bring in the majority of the value and transactions. However the data is not ready for model creation, as there are certain values which are missing. Transactions with one missing value we can impute, but some have both the "Val" and "Quant" missing. So Part 2 will deal with what we data scientist 'love' the most: data cleaning,  and missing data. So stay tuned!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn


Fraudulent Transactions Series:
1. Predicting Fraudulent Transactions in R: Part 1. Transactions
2. Predicting Fraudulent Transactions in R: Part 2. Handling Missing Data
3. Predicting Fraudulent Transactions in R: Part 3. Handling Transaction Outliers
4. Predicting Fraudulent Transactions in R: Part 4. Model Criterion, Precision & Recall
5. Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
.

6 comments:

  1. Really cool post, highly informative and professionally written and I am glad to be a visitor of this perfect blog, thank you for this rare info!

    Data Science Online Training|
    R Programming Online Training|
    Hadoop Online Training

    ReplyDelete
  2. Thanks for your informative blog!!! Keep on updating your with such awesome information.

    Data Science Online Training|
    R Programming Online Training|
    Hadoop Online Training

    ReplyDelete
  3. Thanks for your informative blog!!! Keep on updating your with such awesome information.

    Data Science Online Training|
    R Programming Online Training|
    Hadoop Online Training

    ReplyDelete
  4. Thank you so much for your best effort.

    ReplyDelete
  5. Great blog.you put Good stuff.All the topics were explained briefly.so quickly understand for me.I am waiting for your next fantastic blog.Thanks for sharing.Any coures related details learn..

    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|
    Data science training in Marathahalli|
    Data science training in Bangalore|

    ReplyDelete