Pages

Wednesday, June 4, 2014

KDD Cup: Profit Optimization in R Part 4: Selecting Trees


Hello Readers,

This post covers a case study in optimizing donation returns with data from the Knowledge Discovery and Data mining competition. Here in Part 4, we transition to selecting the optimal tree that best predicts the donation profit. In Part 3, we visualized the trees from a set of parameters, and here we compare tree averages among sets with different parameters. This way we can determine which tree parameters we ought to use to generate our binary-tree model.

In Part 2, we created a for loop to construct 9 trees for a set of parameters. Here we require R code for the for loop to be more concise, as a complete loop saves a set of trees in 9 rdata files filling up 200+ MB, and outputs 3 graphs each to a PDF file. To get the results saved to a CSV file quickly, we distill the for loop down to essentials only. Below is a plot for cumulative donations for the 9 trees we created previously. We will create more sets of 9 trees with different parameters, and will plot the average trees together to compare.




for Loop as a Function


You might remember the for loop from Part 2, where we included many outputs to track the progress of the loop. However, tracking the loop process is not our top priority here. Instead we want speed, because we will run a few sets of trees with different parameters, which will take more than a few minutes.

So what do we do? We simply recode the loop into a function, so we can change the parameters then call the designated function. I named it "makeTree", and it creates its own set of "allTotalDonation", "allAvgDonation", and "allDonationPercentile" matrices to store the results. Additionally, we can pass parameter arguments to the function, which is preset to the parameters we ran in the previous posts ("1000-400-4-10"). The "strParameters" will be newly initiated with the the default or any passed arguments.

Automated Decision Trees Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
> # create makeTree function ####
> # run for multiple times and get the average result
> makeTree <- function(LoopNum=9, testSize=28624, nRec=95412, cost=0.68,
+                      MinSplit=1000, MinBucket=400,
+                      MaxSurrogate=4, MaxDepth=10) {
+   
+   # initialize parameters string
+   strParameters <- paste(MinSplit, MinBucket, MaxSurrogate,
+                          MaxDepth, sep="-")
+   
+   # create matrices for results
+   allTotalDonation <- matrix(0, nrow=testSize, ncol=LoopNum)
+   allAvgDonation <- matrix(0, nrow=testSize, ncol=LoopNum)
+   allDonationPercentile <- matrix (0, nrow=testSize, ncol=LoopNum)
+   cat("Begin Tree Building \n\n")
+   cat(paste("Parameters: ", strParameters, "\n\n", sep=""))
+   
+   for (loopCnt in 1:LoopNum) {
+     cat(date(), ": iteration = ", loopCnt, "\n")
+     
+     # split into training data and testing data
+     trainIdx <- sample(1:nRec, trainSize)
+     trainData <- cup98[trainIdx,]
+     testData <- cup98[-trainIdx,]
+     
+     # train a decision tree
+     cup.Ctree <- ctree(TARGET_D ~ ., data=trainData,
+                        controls=ctree_control(minsplit=MinSplit,
+                                               minbucket=MinBucket,
+                                               maxsurrogate=MaxSurrogate,
+                                               maxdepth=MaxDepth))
+     
+     # size of tree
+     print(object.size(cup.Ctree), units="auto")
+     
+     # test
+     pred <- predict(cup.Ctree, newdata=testData, type="response")
+     print(sum(testData$TARGET_D[pred > cost] - cost))
+     
+     # build donation matrices
+     # quicksort used to random tie values
+     s1 <- sort(pred, decreasing=TRUE, method="quick",
+                index.return=TRUE)
+     totalDonation <- cumsum(testData$TARGET_D[s1$ix]) # cumulative sum
+     avgDonation <- totalDonation / (1:testSize)
+     donationPercentile <- 100 * totalDonation / sum(testData$Target_D)
+     allTotalDonation[,loopCnt] <- totalDonation
+     allAvgDonation[,loopCnt] <- avgDonation
+     allDonationPercentile[,loopCnt] <- donationPercentile
+   }
+   cat(date(), ": Loop Completed. \n\n\n")
+   fnlTotalDonation <- rowMeans(allTotalDonation)
+   fnlAveDonation <- rowMeans(allAvgDonation)
+   fnlDonationPercentile <- rowMeans(allDonationPercentile)
+   
+   # writes all loop results into csv
+   rm(trainData, testData, pred)
+   results <- data.frame(cbind(allTotalDonation, fnlTotalDonation))
+   names(results) <- c(paste("run", 1:LoopNum), "Average")
+   write.csv(results, paste("evaluation-TotalDonation-", strParameters,
+                            ".csv", sep=""))
+ }

Note that the function also includes writing the CSV to working directory after the loop completes. The CSV will have the "evaluation-TotalDonation-" name followed by the specific "strParameters" ran by the loop.



Trees with Different Parameters


Now that we have our "makeTree" function, which parameters should we change from the "1000-400-4-10" we used previously? Recall the "strParameters" string consists of "MinSplit", "MinBucket", "MaxSurrogate", and "MaxDepth"

A quick recap of the parameters:
  • MinSplit: the minimum sum of weights in a node to be eligible for splitting
  • MinBucket: the minimum sum of weights in a terminal node
  • MaxSurrogate: number of surrogate splits to evaluate
  • MaxDepth: the maximum depth of the tree
We can change the "MinSplit" parameter from 1000 to 700, or to 200. We can also decrease the "MinBucket" from 400 to 200 and 50. By decreasing the "Split" and "Bucket" values, we allow the tree model to fit the data further, since the sum of weight limits will be lower.  Another option is to change the "MaxDepth" from 10 to 8, 6, or 5. When we decrease the "depth" of the tree, we decrease the size of the tree, so there are less splits. The "MaxSurrogate" parameter we do not alter. 

Let us split the different parameters into two groups, and tackle them separately. First we have the parameters where we change the "MaxDepth" only. Second, we have the parameters with changes to "MinSplit" and "MinBucket". How do we utilize the "makeTree" function here? By employing the for loop. 

The first set of parameters we simply loop through the different "MaxDepth" parameters. The second set of parameters requires something different. We create a matrix of the new parameters and loop through them row by row (or by column depending on your matrix input). This generalized method also allows us to change any parameter any number of times. But since we are only changing two parameters each time, we do not alter the other two parameters, and use the defaults. 

You can create a custom matrix for the different parameters you wish to pass to the "makeTree" function. So the "MaxDepth" parameters could have been included in the parameters matrix allowing you to run just one loop. Below we have the for loop code for the parameters we split into two sets.

Changing Parameters Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> ## looping makeTree fxn ####
> # changing MaxDepth
> parameters1 <- c("1000-400-4-5", "1000-400-4-6", "1000-400-4-8",
+                  "1000-400-4-10")
>
> # loop MaxDepth 5, 6, 8, 10
> for (i in c(5,6,8,10)){ 
+   MaxDepth <- i
+   makeTree(MaxDepth=MaxDepth)
+ }
>
> # changing MinSplit, MinBucket
> parameters2 <- c("1000-400-4-10", "700-200-4-10", "200-50-4-10")
> p2 <- matrix(c(1000, 400, 700, 200, 200, 50), ncol=2, byrow=TRUE)
> p2
     [,1] [,2]
[1,] 1000  400
[2,]  700  200
[3,]  200   50
>
> # loop p2 parameter matrix
> for (row in 1:dim(p2)[1]) {
+   MinSplit <- p2[row,1]
+   MinBucket <- p2[row,2]
+   makeTree(MinSplit=MinSplit,MinBucket=MinBucket)
+ }
>

After running those two loops (remember to load the
party library in R), we now have seven CSV results files. To compare the different tree models, we analyze them graphically.


A Visual Comparison


We plot the results from the first set of parameters first. With the desired data column residing across four files, we use the string elements in the "parameters1" to expedite reading the CSV results files. In each CSV results file, we isolate the "Average" column, and account for the cost of the mail-in solicitation. We plot the first element in "parameters1", and add lines by looping through the rest of the "parameters1". This is the same for the second set of parameters, though it only has three elements in "parameters2". Make sure to add a legend to discern which lines describe which parameters.


Plotting Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
> # visualizing the results ####
>
> # parameters1
> # plot first line
> results <- read.csv(paste("evaluation-TotalDonation-",
+ parameters1[1], ".csv", sep=""))
> avgResult <- results$Average - 0.68 * (1:dim(results)[1])
> plot(percentile, avgResult, pty=1, type="l", lty=1, col=1,
+      ylab="Amount of Donation", xlab="Contact Percentile (%)",
+      main="Parameters: MinSplit, MinBucket, MaxSurrogate,
+      MaxDepth", ylim=c(0,4000))
> grid(col="gray", lty="dotted")
> 
> # other lines
> for (i in 2:length(parameters1)) {
+   results <- read.csv(paste("evaluation-TotalDonation-", 
+                             parameters1[i], ".csv", sep=""))
+   avgResult <- results$Average - 0.68 * (1:dim(results)[1])
+   lines(percentile, avgResult, type="l", lty=i, col=i)
+ }
> legend("bottomright", col=1:length(parameters1), lty=1:length(parameters1),
+        legend=parameters1)
> 
>
> # parameters2
> # plot first line
> results <- read.csv(paste("evaluation-TotalDonation-", 
+                           parameters2[1], ".csv", sep=""))
> avgResult <- results$Average - 0.68 * (1:dim(results)[1])
> plot(percentile, avgResult, pty=1, type="l", lty=1, col=1,
+      ylab="Amount of Donation", xlab="Contact Percentile (%)",
+      main="Parameters: MinSplit, MinBucket, MaxSurrogate,
+      MaxDepth", ylim=c(0,4000))
> grid(col="gray", lty="dotted")
> 
> # other lines
> for (i in 2:length(parameters2)) {
+   results <- read.csv(paste("evaluation-TotalDonation-", 
+                             parameters2[i], ".csv", sep=""))
+   avgResult <- results$Average - 0.68 * (1:dim(results)[1])
+   lines(percentile, avgResult, type="l", lty=i, col=i)
+ }
> legend("bottomright", col=1:length(parameters2), lty=1:length(parameters2),
+        legend=parameters2)
>

Comparison of Parameters- 1

Above we have the varied "MaxDepth" plot, and below we have the different "MinSplit" and "MinBucket" plot.


Comparison of Parameters- 2

Now that we have plotted different parameters together, we can compare how well they modeled the KDD Cup data to predict donation profit outcomes.


Tree Selection


We want the tree model which best predicts the donations from our selected variables, so we can estimate how much donation profit we can expect to earn from future mail-in orders. Therefore, we can select for mailing only those fitting specific criteria, optimizing our donation profit.


Let us take a look at the first parameters plot where we varied the "MaxDepth". For the contact percentiles 20% and 40%, we observe the cumulative donations at those positions. The blue "1000-400-4-10" line reaches higher donation levels at similar contact percentile than the other lines of "MaxDepth" values. It appears that a higher "MaxDepth" allows the tree to model the data better than lower "MaxDepth" values. The black line with the lowest "MaxDepth" value of 5 has lower donation values at similar contact percentiles. This is more reason to use a higher "MaxDepth" value.

With the second parameters plot in mind, we can observe the outcome of changing the "MinSplit" and "MinBucket" parameters. As we decreased the value of the two parameters, we see that the cumulative donations as the green and red lines at earlier contact percentiles fall lower than the donations at similar contact percentiles with higher "MinSplit" and "MinBucket" values (1000-400). This is due to the sum of weights being higher so that the tree model does not overfit the data, since we are using the values from untrained test set. Thus the optimal "MinSplit" and "MinBucket" values are high enough such that they do not overfit the data and can be used to extrapolate on untrained test data.


So we will select the binary-tree model with parameters of "1000-400-4-10" for the "MinSplit, MinBucket, MaxSurrogate, MaxDepth" variables respectively. You can run more models with different parameters to get a better idea of how the parameters change the predictions. 



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

More:
1. KDD Cup: Profit Optimization in R Part 1: Exploring Data
2. KDD Cup: Profit Optimization in R Part 2: Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation

1 comment:

  1. Informasi Khusus Untuk Kamu Pecinta Sabung Ayam Indonesia !

    Agen Bolavita memberikan Bonus sampai dengan Rp 1.000.000,- Khusus Merayakan Natal & Tahun Baru !

    Untuk Informasi Selengkapnya langsung saja hubungi cs kami yang bertugas :
    WA : +62812-2222-995
    BBM : BOLAVITA
    Situs : www.bolavits.site

    Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam


    Baca Artikel Sepak bola terbaru di bolavitasport.news !
    Prediksi Angka Togel Online Terupdate di angkamistik.net!

    ReplyDelete