Pages

Sunday, June 8, 2014

KDD Cup: Profit Optimization in R Part 5: Evaluation


Hello Readers,

Today we wrap up this case study on profit optimization using the data from the Knowledge Discovery and Data mining competition. In Part 4 we selected the best decision tree parameters to predict donations. Here in Part 5 we will take the best of the 9 trees created with the parameters in Part 2, and use the new data in the evaluation file. Lastly, to verify the predicted donations, we will compare them with the evaluation target text file (download here).

Let us start R, and begin data crunching!


Evaluation Data


So far we have been using the data from "cup98LRN.txt" to train various decision tree models. After we selected the best parameters for the optimal tree, we can use that tree object to predict donations (the "TARGET_D" variable) from "cup98VAL.txt". Since it is in CSV format, use
read.csv() to read the file into R. Remember to set the working directory to where the files are located.

The "varSet2" is the same variable set used in training the decision trees in Part 2. We then load the "cup98" data to compare existing variables, and we see that our training data includes "TARGET_D" and the test "cup98VAL" data does not include it. Makes sense, since we aim to predict that variable with our tree model.


Loading Evaluation Data Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> library(party)
> # training variable names
> varSet2 <- c("AGE", "AVGGIFT", "CARDGIFT", "CARDPM12",
+              "CARDPROM", "CLUSTER2", "DOMAIN", "GENDER", "GEOCODE2", "HIT",
+              "HOMEOWNR", "HPHONE_D", "INCOME", "LASTGIFT", "MAXRAMNT",
+              "MDMAUD_F", "MDMAUD_R", "MINRAMNT", "NGIFTALL", "NUMPRM12",
+              "PCOWNERS", "PEPSTRFL", "PETS", "RAMNTALL", "RECINHSE",
+              "RFA_2A", "RFA_2F", "STATE", "TIMELAG")
> 
> # load training data
> load("cup98var2.rdata")
> 
> # 5. scoring validation data ####
> # use test dataset cup98VAL.txt
> cup98val <- read.csv("cup98VAL.txt")
> cup98val <- cup98val[, c("CONTROLN", varSet2)]
>
> trainNames <- names(cup98)
> scoreNames <- names(cup98val)
>
> # check if any variables not in scoring data
> idx <- which(trainNames %in% scoreNames)
> print(trainNames[-idx]) # TARGET_D not in scoreNames
[1] "TARGET_D"

But we are not finished checking the evaluation data. We need to verify that the evaluation data has same factor levels as the factor levels in the training "cup98" data. Since the tree model is trained on the training levels, it cannot evaluate the new levels. Take, for example,  a fictitious categorical variable named "incomeLevel" with three levels at "low", "middle", and "high". If in the test data the "incomeLevel" variable had an additional level, "upperMiddle", then the model can not take into account the extra level. So if there are any extra factor levels in the new data, then we must catch them, change them to NA. 


Using only the variables in common, we loop through them to evaluate the levels by variable. If the variable is a factor (is.factor())and the levels are not equal (setequal()), then we will print out such, and set the levels to the levels in the training data.

Setting Factor Levels Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
> # check and set levels in factors in scoring data
> scoreData <- cup98val
> vars <- intersect(trainNames, scoreNames)
> for(i in 1:length(vars)) {
+   varname <- vars[i]
+   trainLevels <- levels(cup98[, varname])
+   scoreLevels <- levels(scoreData[, varname])
+   if(is.factor(cup98[,varname]) & 
+        setequal(trainLevels, scoreLevels)==FALSE) 
+     {
+     cat("Warning: new values found in score data, 
+         and they will be changed to NA!\n\n")
+     cat("old ", varname, " levels: ", "\n")
+     cat(scoreLevels, "\n\n")
+     scoreData[, varname] <- factor(scoreData[, varname],
+                                    levels=trainLevels)
+     cat("new ", varname, " levels: ", "\n")
+     cat(levels(scoreData[, varname]),"\n\n")
+   }
+ }
Warning: new values found in score data, 
        and they will be changed to NA!

old  GENDER  levels:  
  F J M U 

new  GENDER  levels:  
  A C F J M U 

Warning: new values found in score data, 
        and they will be changed to NA!

old  STATE  levels:  
AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA WI WV WY 

new  STATE  levels:  
AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VI VT WA WI WV WY 
> 

Inspecting the results, the loop caught 2 variables in the test data which had different factor levels than the training data: "GENDER", and "STATE". For "GENDER", the old levels did not include "A" or "C" levels, so they were appended into the test "GENDER" levels. Since the loop did not see any "A" or "C" levels, in the new data, they would not be used, even though they are possible levels. Looking at the "STATE" variable, we see that the loop cut two levels from the test data: "PR", and "PW". They could have been typos, or "PR" might have represented Puerto Rico. Can you think of a "PW" territory? I could not either. Anyways, the model could not have evaluated them anyways, since "STATE" did not include the two levels when we ran the tree model.


Now we are ready to predict the donations: "TARGET_D".


Predicting Donations


When we ran the model loop we divided the training data into 'training' and 'test' partitions to measure how well the tree fit the training data, because we could compare the predicted target variable from 'test' data to the actual target variable. Now after we selected the best decision tree in Part 4, we will load the tree object here to evaluate the true evaluation test data, "cup98VAL.txt".

We will use the "1000-400-4-10" parameters, run number 4 tree. Make sure you have loaded the party library prior to passing the tree object through the predict() function. Remember that the "scoreData" object points to the corrected factor levels test data, and we create a data.frame called "result" to include the prediction and the identification number, "CONTROLN".

Predicting from Evaluation Data Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
> # after evaluation data is prepared, create predictions
> # load selected tree model
> load("cup98-ctree-1000-400-4-10-run-4.rdata")
> # predicting
> pred <- predict(cup.Ctree, newdata=scoreData)
> pred <- round(pred, digits=3)
> 
> table(pred, useNA="ifany")
pred
0.545 0.585 0.591 0.639 0.667 0.741 0.869 0.963 0.984 1.062 1.182 1.307 
11051 32563  2539  7228  7735 10729  6349  4251  2536  1656  1695  1448 
1.453 1.573 1.676 1.795 1.821 2.051 3.349 
  867  1253   555   651   585  1996   680 
>
> # barplot of prediction
> barplot(table(pred), las=2, ylim=c(0,35000),
+         main="Number of Amount of Donations",
+         cex.names=0.8, cex.axis=0.8,
+         xlab="Donation Amount",
+         ylab="Number of Donations")
> grid(nx=NA,ny=14)
>
> result <- data.frame(scoreData$CONTROLN, pred)
> names(result) <- c("CONTROLN", "pred")
> str(result)
'data.frame': 96367 obs. of  2 variables:
 $ CONTROLN: int  188946 126296 155244 123985 119118 10120 59465 80803 2824 145014 ...
 $ pred    : num  1.062 0.585 0.585 0.545 0.667 ...
> 

After we predict the target values, and round the results, we take a look at them in a table. But better yet, look at them in a barplot:




The barplot tells us a few things about the distribution of the prediction values. Firstly, the range of the donation predictions ($) were from 0.545 to 3.349, with peaks at 0.585 and 0.741. Recall that the distributions of donations in the training data, was also skewed to the right, but ranged from $1 to $200 (below). The majority of those donations were below $27.




So the model did not predict the donations perfectly, as we did not encounter similar high predicted donations as the high values in the training data were outliers. It is possible that there were no donations in the evaluation data which were higher than $50. However we saw a pattern of training donations occurring in multiples of five so predicted values between $1 and $5 seem unlikely.


Nevertheless, we still have our predicted values, and can calculate the donation profit by subtracting the cost of 68 cents.



<black sheep wall>


Luckily for us, we have access to the true donation values in the "valtargt.txt" file from the DATA section of the 1998 KDD Cup. It has the "TARGET_B" and "TARGET_D" variables identified by "CONTROLN". So using the code below, we load the CSV file and
merge() the true values with the predicted results. Again, make sure the file is in your working directory.

Why black sheep wall? Because the "valtargt.txt" enables us to see the true donation values for the validation file, and the black sheep wall cheat code disabled the fog of war in Starcraft.

True Donations Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
> # load evaluation targets
> valTarget <- read.csv("valtargt.txt")
> # nrows=96,367
>
> # create merged data.frame with actual and results ####
> merged <- merge(result, valTarget, by="CONTROLN")
> merged <- merged[order(merged$pred, decreasing=TRUE),]

> # plot regression pred and target
> library(ggplot2)
> ggplot(merged, aes(pred, TARGET_D)) 
+ geom_point(colour="red",alpha=0.05)
>

We order the "merged" data.frame by decreasing prediction values, similar to what we did when creating the decision trees in Part 2. The actual donation values will be ordered by the predicted values, so we can observe how close the predicted values are to the actual donations. Using ggplot(), we plot a simple regression of the prediction "pred" actual donations "TARGET_D".


We notice the majority of donations fall below $50, while the majority of predicted donations fall below $2. In the test data, there was a high donation at $500, compared to the high of $200 in the training data. Now that we have seen the compactness of our predictions, we will calculate the donation profit based on the true donations, using sum().

Calculating Donation Profit Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
> # donation profit if mail all people ####
> # nrows=96,367
> cost <- 0.68
> sum(valTarget$TARGET_D - cost)
[1] 10560.08
> 
> # donation profit if mail those predicted ####
> # to donate more than mail cost, nrows=35,250
> idx <- (merged$pred > cost)
> sum(merged$TARGET_D[idx] - cost)
[1] 11857.77
> 

The donation profit sums to $10,560.08. How about only mailing the orders to those predicted to donate more than the $0.68 cost? Creating an index of those predictions more than the cost, we can subset the donation values, and arrive at $11,857.77. By using the selective mailing technique, we increase the profit by $1,297.69.



Visualizing Donations


First we plot the cumulative donations if we mailed the donation solicitations to all the customers. We create a percentage scale for the x-axis and use the cumsum() to calculate the cumulative sum for the y-axis, while factoring in the cost. Again, like in Part 3, we devise an index with increments of 10 for R to plot more easily.

We index by 10 for the second plot as well, where we plot the percentage of optimal customers by their cumulative donations. Note that we optimized the customers by their predicted donation value while accounting for the cost, and that our predictions were not the best at predicting donations (see above regression plot). But it is better than mailing to everyone? For this model, yes, by $1,297.69 if we mailed it to 35,250 of the 96,367 total customers.

Creating Donation Visualizations Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
> # plot ranking all customers ####
> x <- 100 * (1:nrow(merged)) / nrow(merged)
> y <- cumsum(merged$TARGET_D) - cost*(1:nrow(valTarget))
> # to reduce size of the file to save this chart
> # n=9,638
> idx.pos <- c(seq(1, length(x), by=10), length(x))
> plot(x[idx.pos], y[idx.pos], type="l", xlab="Contact Percentile (%)",
+      ylab="Amount of Donation", main="All Customers")
> grid()
> 
> # plot ranking only optimal customers ####
> x1 <- 100 * (1:nrow(merged[idx,])) / nrow(merged[idx,])
> y1 <- cumsum(merged$TARGET_D[idx]) - cost*(1:nrow(valTarget[idx,]))
> idx.pos1 <- c(seq(1, length(x1), by=10), length(x1))
> plot(x1[idx.pos1], y1[idx.pos1], type="l", xlab="Contact Percentile (%)",
+      ylab="Amount of Donation", main="Optimal Customers")
> grid()
> 
> # plot both customer group donations ####
> x3 <- (1:nrow(merged)/1000)
> plot(x3[idx.pos], y[idx.pos], type="l", col=1, ylim=c(0,13000),
+      xlab="Number Contacted, Thousands", ylab="Amount of Donation $", 
+      main="Cumulative Donations for Both Customer Groups")
> grid()
> par(new=TRUE)
> plot(x3[idx.pos1], y1[idx.pos1], type="l", col=2, 
+      xaxt="n", yaxt="n", xlab="", ylab="",
+      xlim=c(0,100), ylim=c(0,13000))
> legend("bottomright", col=c(1,2), lty=c(1,1),
+        legend=c("All Customers (n=96,367)", 
+                 "Profitable Customers Only (n=35,250)"))
> 

As a visual bonus, I include a third plot where we overlay both plots to showcase the cumulative donations accrued by the different number of customers mailed. Below is the first plot where we mailed to all the customers. Recall the donation profit for all customers is just above $10.5k. We see large donation increases in the first 20% of the customer mailings. The increases indicate that the predictions captured large true donation values, as they were ordered in decreasing order by the predictions. Additionally, we see the peak approaches $13k, but we mailed to all customers, so we incurred deficits whenever some of them did not donate. See the drop in donations in the 60%-80% customer contacts.




So taking into account not all customers donated, or donated more than the $0.68 cost, we proactively selected those predicted to donate above the 68 cents cost, shown below. Immediately we see differences in the cumulative plot curve compared to the plot above. Here we see a general increase in cumulative donation profit- indicating that the predicted values captured higher true donation values above the cost. With some drops or plateaus in cumulative donations, the predicted values also captured customers so did not donate sufficient amounts. Also possibly not all who donated sufficient amounts were captured in the predicted donations to show in the plot of true donations. However, notice this total donation position is higher than the total donation position in the all customers plot.




Now we turn to the plot with both cumulative donation lines coexist, below. Instead of using customer percentile in the x-axis, we use the customer number in thousands to compare the number of customers mailed. With the all customer plot a familiar sight, note the red line restricting donation amounts to customers who donated more than the cost. The red line mainly follows the all customers line, but begins to overtake the regular line in cumulative donations around 10k customers. It stops just under $12k after 30k customers. The black line includes all customers, regardless of how they donated, so it extends farther than the selective red line. Here you can compare the y-axis total donation end points for both customer groups, with the red line ending higher then the black line.





Notes


So how do we improve the model? There were two target variables- "TARGET_B" which indicated yes or no to donation status, and "TARGET_D" which described the amount of donation. A next step would be to create a two tier model to predict who would donate, "TARGET_B", and then of those who would donate, how much ("TARGET_D"). With our model revealing a donation profit of $11,857.77, what is the max donation profit using the evaluation data? How well did we predict the donations?

Max Donation Profit Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
> # optimal donation profit ####
> # those who donated more than cost, n=4,872
> merged <- merged[order(merged$TARGET_D, decreasing=TRUE),]
> idx <- (merged$TARGET_D > cost)
> sum(idx)
[1] 4872
> sum(merged$TARGET_D[idx] - cost)
[1] 72776.36
>
> # plotting cumulative donations
> plot(1:sum(idx), cumsum(merged$TARGET_D[idx]-cost),
+ type="l", ylim=c(0,80000),
+ ylab="Cumulative Donations ($)",
+ xlab="Customers Mailed",
+ main="Optimal Customer Selection",
+ sub="$72,776.36 Max Donation Profit")
> grid()

Out of the 96,367 customers in the evaluation data, selecting only those who donated more than 68 cents, we would mail donation requests to 4,872 customers for a profit of $72,778.36. Wow, such a large donation profit! Quickly we realize that our donation profit was so low because we failed to eliminate those who would donate insufficient amounts, thus mailing it to over 32,000 customers, instead of the optimal 4,872. Looking at the real values, our optimal model was not very optimal at all!




With perfect customer selection, donation ordering, and immaculate hindsight, we observe the cumulative donation plot above having no drops- therefore no deficits- in donations.


Well folks, thank you for following along this case study series on optimizing profits with decision trees using previous KDD data. Here is a link to the 2014 KDD conference in New York City. Stay tuned for more R posts, and case studies! Let me know what you would like to see on the Valence Analytics blog.


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

More:
1. KDD Cup: Profit Optimization in R Part 1: Exploring Data
2. KDD Cup: Profit Optimization in R Part 2: Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation

5 comments:

  1. Hi Wayne, great set of articles. Thanks for sharing.
    It would be great to have a section about Feature Selection.

    ReplyDelete
  2. is there any way to see who are the optimal customers according to the predication tree?

    ReplyDelete
  3. hey,
    can you please tell me what will be the future scope of this case study???

    ReplyDelete
  4. Informasi Khusus Untuk Kamu Pecinta Sabung Ayam Indonesia !

    Agen Bolavita memberikan Bonus sampai dengan Rp 1.000.000,- Khusus Merayakan Natal & Tahun Baru !

    Untuk Informasi Selengkapnya langsung saja hubungi cs kami yang bertugas :
    WA : +62812-2222-995
    BBM : BOLAVITA
    Situs : www.bolavits.site

    Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam


    Baca Artikel Sepak bola terbaru di bolavitasport.news !
    Prediksi Angka Togel Online Terupdate di angkamistik.net!

    ReplyDelete
  5. Une entrée très intéressante pleine d’informations inspirantes. Veuillez également visiter le Gowork FR Blog

    ReplyDelete