Valence Analytics: decision trees

Showing posts with label decision trees. Show all posts

Sunday, June 8, 2014

KDD Cup: Profit Optimization in R Part 5: Evaluation

Hello Readers,

Today we wrap up this case study on profit optimization using the data from the Knowledge Discovery and Data mining competition. In Part 4 we selected the best decision tree parameters to predict donations. Here in Part 5 we will take the best of the 9 trees created with the parameters in Part 2, and use the new data in the evaluation file. Lastly, to verify the predicted donations, we will compare them with the evaluation target text file (download here).

Let us start R, and begin data crunching!

Evaluation Data

So far we have been using the data from "cup98LRN.txt" to train various decision tree models. After we selected the best parameters for the optimal tree, we can use that tree object to predict donations (the "TARGET_D" variable) from "cup98VAL.txt". Since it is in CSV format, use read.csv() to read the file into R. Remember to set the working directory to where the files are located.

The "varSet2" is the same variable set used in training the decision trees in Part 2. We then load the "cup98" data to compare existing variables, and we see that our training data includes "TARGET_D" and the test "cup98VAL" data does not include it. Makes sense, since we aim to predict that variable with our tree model.

Loading Evaluation Data Code:

> library(party)
> # training variable names
> varSet2 <- c("AGE", "AVGGIFT", "CARDGIFT", "CARDPM12",
+              "CARDPROM", "CLUSTER2", "DOMAIN", "GENDER", "GEOCODE2", "HIT",
+              "HOMEOWNR", "HPHONE_D", "INCOME", "LASTGIFT", "MAXRAMNT",
+              "MDMAUD_F", "MDMAUD_R", "MINRAMNT", "NGIFTALL", "NUMPRM12",
+              "PCOWNERS", "PEPSTRFL", "PETS", "RAMNTALL", "RECINHSE",
+              "RFA_2A", "RFA_2F", "STATE", "TIMELAG")
> 
> # load training data
> load("cup98var2.rdata")
> 
> # 5. scoring validation data ####
> # use test dataset cup98VAL.txt
> cup98val <- read.csv("cup98VAL.txt")
> cup98val <- cup98val[, c("CONTROLN", varSet2)]
>
> trainNames <- names(cup98)
> scoreNames <- names(cup98val)
>
> # check if any variables not in scoring data
> idx <- which(trainNames %in% scoreNames)
> print(trainNames[-idx]) # TARGET_D not in scoreNames
[1] "TARGET_D"

But we are not finished checking the evaluation data. We need to verify that the evaluation data has same factor levels as the factor levels in the training "cup98" data. Since the tree model is trained on the training levels, it cannot evaluate the new levels. Take, for example, a fictitious categorical variable named "incomeLevel" with three levels at "low", "middle", and "high". If in the test data the "incomeLevel" variable had an additional level, "upperMiddle", then the model can not take into account the extra level. So if there are any extra factor levels in the new data, then we must catch them, change them to NA.

Using only the variables in common, we loop through them to evaluate the levels by variable. If the variable is a factor (is.factor())and the levels are not equal (setequal()), then we will print out such, and set the levels to the levels in the training data.

Setting Factor Levels Code:

> # check and set levels in factors in scoring data
> scoreData <- cup98val
> vars <- intersect(trainNames, scoreNames)
> for(i in 1:length(vars)) {
+   varname <- vars[i]
+   trainLevels <- levels(cup98[, varname])
+   scoreLevels <- levels(scoreData[, varname])
+   if(is.factor(cup98[,varname]) & 
+        setequal(trainLevels, scoreLevels)==FALSE) 
+     {
+     cat("Warning: new values found in score data, 
+         and they will be changed to NA!\n\n")
+     cat("old ", varname, " levels: ", "\n")
+     cat(scoreLevels, "\n\n")
+     scoreData[, varname] <- factor(scoreData[, varname],
+                                    levels=trainLevels)
+     cat("new ", varname, " levels: ", "\n")
+     cat(levels(scoreData[, varname]),"\n\n")
+   }
+ }
Warning: new values found in score data, 
        and they will be changed to NA!

old  GENDER  levels:  
  F J M U 

new  GENDER  levels:  
  A C F J M U 

Warning: new values found in score data, 
        and they will be changed to NA!

old  STATE  levels:  
AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA WI WV WY 

new  STATE  levels:  
AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VI VT WA WI WV WY 
>

Inspecting the results, the loop caught 2 variables in the test data which had different factor levels than the training data: "GENDER", and "STATE". For "GENDER", the old levels did not include "A" or "C" levels, so they were appended into the test "GENDER" levels. Since the loop did not see any "A" or "C" levels, in the new data, they would not be used, even though they are possible levels. Looking at the "STATE" variable, we see that the loop cut two levels from the test data: "PR", and "PW". They could have been typos, or "PR" might have represented Puerto Rico. Can you think of a "PW" territory? I could not either. Anyways, the model could not have evaluated them anyways, since "STATE" did not include the two levels when we ran the tree model.

Now we are ready to predict the donations: "TARGET_D".

Predicting Donations

When we ran the model loop we divided the training data into 'training' and 'test' partitions to measure how well the tree fit the training data, because we could compare the predicted target variable from 'test' data to the actual target variable. Now after we selected the best decision tree in Part 4, we will load the tree object here to evaluate the true evaluation test data, "cup98VAL.txt".

We will use the "1000-400-4-10" parameters, run number 4 tree. Make sure you have loaded the party library prior to passing the tree object through the predict() function. Remember that the "scoreData" object points to the corrected factor levels test data, and we create a data.frame called "result" to include the prediction and the identification number, "CONTROLN".

Predicting from Evaluation Data Code:

> # after evaluation data is prepared, create predictions
> # load selected tree model
> load("cup98-ctree-1000-400-4-10-run-4.rdata")
> # predicting
> pred <- predict(cup.Ctree, newdata=scoreData)
> pred <- round(pred, digits=3)
> 
> table(pred, useNA="ifany")
pred
0.545 0.585 0.591 0.639 0.667 0.741 0.869 0.963 0.984 1.062 1.182 1.307 
11051 32563  2539  7228  7735 10729  6349  4251  2536  1656  1695  1448 
1.453 1.573 1.676 1.795 1.821 2.051 3.349 
  867  1253   555   651   585  1996   680 
>
> # barplot of prediction
> barplot(table(pred), las=2, ylim=c(0,35000),
+         main="Number of Amount of Donations",
+         cex.names=0.8, cex.axis=0.8,
+         xlab="Donation Amount",
+         ylab="Number of Donations")
> grid(nx=NA,ny=14)
>
> result <- data.frame(scoreData$CONTROLN, pred)
> names(result) <- c("CONTROLN", "pred")
> str(result)
'data.frame': 96367 obs. of  2 variables:
 $ CONTROLN: int  188946 126296 155244 123985 119118 10120 59465 80803 2824 145014 ...
 $ pred    : num  1.062 0.585 0.585 0.545 0.667 ...
>

After we predict the target values, and round the results, we take a look at them in a table. But better yet, look at them in a barplot:

The barplot tells us a few things about the distribution of the prediction values. Firstly, the range of the donation predictions ($) were from 0.545 to 3.349, with peaks at 0.585 and 0.741. Recall that the distributions of donations in the training data, was also skewed to the right, but ranged from $1 to $200 (below). The majority of those donations were below $27.

So the model did not predict the donations perfectly, as we did not encounter similar high predicted donations as the high values in the training data were outliers. It is possible that there were no donations in the evaluation data which were higher than $50. However we saw a pattern of training donations occurring in multiples of five so predicted values between $1 and $5 seem unlikely.

Nevertheless, we still have our predicted values, and can calculate the donation profit by subtracting the cost of 68 cents.

<black sheep wall>

Luckily for us, we have access to the true donation values in the "valtargt.txt" file from the DATA section of the 1998 KDD Cup. It has the "TARGET_B" and "TARGET_D" variables identified by "CONTROLN". So using the code below, we load the CSV file and merge() the true values with the predicted results. Again, make sure the file is in your working directory.

Why black sheep wall? Because the "valtargt.txt" enables us to see the true donation values for the validation file, and the black sheep wall cheat code disabled the fog of war in Starcraft.

True Donations Code:

> # load evaluation targets
> valTarget <- read.csv("valtargt.txt")
> # nrows=96,367
>
> # create merged data.frame with actual and results ####
> merged <- merge(result, valTarget, by="CONTROLN")
> merged <- merged[order(merged$pred, decreasing=TRUE),]

> # plot regression pred and target
> library(ggplot2)
> ggplot(merged, aes(pred, TARGET_D)) 
+ geom_point(colour="red",alpha=0.05)
>

We order the "merged" data.frame by decreasing prediction values, similar to what we did when creating the decision trees in Part 2. The actual donation values will be ordered by the predicted values, so we can observe how close the predicted values are to the actual donations. Using ggplot(), we plot a simple regression of the prediction "pred" actual donations "TARGET_D".

We notice the majority of donations fall below $50, while the majority of predicted donations fall below $2. In the test data, there was a high donation at $500, compared to the high of $200 in the training data. Now that we have seen the compactness of our predictions, we will calculate the donation profit based on the true donations, using sum().

Calculating Donation Profit Code:

> # donation profit if mail all people ####
> # nrows=96,367
> cost <- 0.68
> sum(valTarget$TARGET_D - cost)
[1] 10560.08
> 
> # donation profit if mail those predicted ####
> # to donate more than mail cost, nrows=35,250
> idx <- (merged$pred > cost)
> sum(merged$TARGET_D[idx] - cost)
[1] 11857.77
>

The donation profit sums to $10,560.08. How about only mailing the orders to those predicted to donate more than the $0.68 cost? Creating an index of those predictions more than the cost, we can subset the donation values, and arrive at $11,857.77. By using the selective mailing technique, we increase the profit by $1,297.69.

Visualizing Donations

First we plot the cumulative donations if we mailed the donation solicitations to all the customers. We create a percentage scale for the x-axis and use the cumsum() to calculate the cumulative sum for the y-axis, while factoring in the cost. Again, like in Part 3, we devise an index with increments of 10 for R to plot more easily.

We index by 10 for the second plot as well, where we plot the percentage of optimal customers by their cumulative donations. Note that we optimized the customers by their predicted donation value while accounting for the cost, and that our predictions were not the best at predicting donations (see above regression plot). But it is better than mailing to everyone? For this model, yes, by $1,297.69 if we mailed it to 35,250 of the 96,367 total customers.

Creating Donation Visualizations Code:

> # plot ranking all customers ####
> x <- 100 * (1:nrow(merged)) / nrow(merged)
> y <- cumsum(merged$TARGET_D) - cost*(1:nrow(valTarget))
> # to reduce size of the file to save this chart
> # n=9,638
> idx.pos <- c(seq(1, length(x), by=10), length(x))
> plot(x[idx.pos], y[idx.pos], type="l", xlab="Contact Percentile (%)",
+      ylab="Amount of Donation", main="All Customers")
> grid()
> 
> # plot ranking only optimal customers ####
> x1 <- 100 * (1:nrow(merged[idx,])) / nrow(merged[idx,])
> y1 <- cumsum(merged$TARGET_D[idx]) - cost*(1:nrow(valTarget[idx,]))
> idx.pos1 <- c(seq(1, length(x1), by=10), length(x1))
> plot(x1[idx.pos1], y1[idx.pos1], type="l", xlab="Contact Percentile (%)",
+      ylab="Amount of Donation", main="Optimal Customers")
> grid()
> 
> # plot both customer group donations ####
> x3 <- (1:nrow(merged)/1000)
> plot(x3[idx.pos], y[idx.pos], type="l", col=1, ylim=c(0,13000),
+      xlab="Number Contacted, Thousands", ylab="Amount of Donation $", 
+      main="Cumulative Donations for Both Customer Groups")
> grid()
> par(new=TRUE)
> plot(x3[idx.pos1], y1[idx.pos1], type="l", col=2, 
+      xaxt="n", yaxt="n", xlab="", ylab="",
+      xlim=c(0,100), ylim=c(0,13000))
> legend("bottomright", col=c(1,2), lty=c(1,1),
+        legend=c("All Customers (n=96,367)", 
+                 "Profitable Customers Only (n=35,250)"))
>

As a visual bonus, I include a third plot where we overlay both plots to showcase the cumulative donations accrued by the different number of customers mailed. Below is the first plot where we mailed to all the customers. Recall the donation profit for all customers is just above $10.5k. We see large donation increases in the first 20% of the customer mailings. The increases indicate that the predictions captured large true donation values, as they were ordered in decreasing order by the predictions. Additionally, we see the peak approaches $13k, but we mailed to all customers, so we incurred deficits whenever some of them did not donate. See the drop in donations in the 60%-80% customer contacts.

So taking into account not all customers donated, or donated more than the $0.68 cost, we proactively selected those predicted to donate above the 68 cents cost, shown below. Immediately we see differences in the cumulative plot curve compared to the plot above. Here we see a general increase in cumulative donation profit- indicating that the predicted values captured higher true donation values above the cost. With some drops or plateaus in cumulative donations, the predicted values also captured customers so did not donate sufficient amounts. Also possibly not all who donated sufficient amounts were captured in the predicted donations to show in the plot of true donations. However, notice this total donation position is higher than the total donation position in the all customers plot.

Now we turn to the plot with both cumulative donation lines coexist, below. Instead of using customer percentile in the x-axis, we use the customer number in thousands to compare the number of customers mailed. With the all customer plot a familiar sight, note the red line restricting donation amounts to customers who donated more than the cost. The red line mainly follows the all customers line, but begins to overtake the regular line in cumulative donations around 10k customers. It stops just under $12k after 30k customers. The black line includes all customers, regardless of how they donated, so it extends farther than the selective red line. Here you can compare the y-axis total donation end points for both customer groups, with the red line ending higher then the black line.

Notes

So how do we improve the model? There were two target variables- "TARGET_B" which indicated yes or no to donation status, and "TARGET_D" which described the amount of donation. A next step would be to create a two tier model to predict who would donate, "TARGET_B", and then of those who would donate, how much ("TARGET_D"). With our model revealing a donation profit of $11,857.77, what is the max donation profit using the evaluation data? How well did we predict the donations?

Max Donation Profit Code:

> # optimal donation profit ####
> # those who donated more than cost, n=4,872
> merged <- merged[order(merged$TARGET_D, decreasing=TRUE),]
> idx <- (merged$TARGET_D > cost)
> sum(idx)
[1] 4872
> sum(merged$TARGET_D[idx] - cost)
[1] 72776.36
>
> # plotting cumulative donations
> plot(1:sum(idx), cumsum(merged$TARGET_D[idx]-cost),
+ type="l", ylim=c(0,80000),
+ ylab="Cumulative Donations ($)",
+ xlab="Customers Mailed",
+ main="Optimal Customer Selection",
+ sub="$72,776.36 Max Donation Profit")
> grid()

Out of the 96,367 customers in the evaluation data, selecting only those who donated more than 68 cents, we would mail donation requests to 4,872 customers for a profit of $72,778.36. Wow, such a large donation profit! Quickly we realize that our donation profit was so low because we failed to eliminate those who would donate insufficient amounts, thus mailing it to over 32,000 customers, instead of the optimal 4,872. Looking at the real values, our optimal model was not very optimal at all!

With perfect customer selection, donation ordering, and immaculate hindsight, we observe the cumulative donation plot above having no drops- therefore no deficits- in donations.

Well folks, thank you for following along this case study series on optimizing profits with decision trees using previous KDD data. Here is a link to the 2014 KDD conference in New York City. Stay tuned for more R posts, and case studies! Let me know what you would like to see on the Valence Analytics blog.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

More:
1. KDD Cup: Profit Optimization in R Part 1 : Exploring Data
2. KDD Cup: Profit Optimization in R Part 2 : Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation

Wednesday, June 4, 2014

KDD Cup: Profit Optimization in R Part 4: Selecting Trees

Hello Readers,

This post covers a case study in optimizing donation returns with data from the Knowledge Discovery and Data mining competition. Here in Part 4, we transition to selecting the optimal tree that best predicts the donation profit. In Part 3, we visualized the trees from a set of parameters, and here we compare tree averages among sets with different parameters. This way we can determine which tree parameters we ought to use to generate our binary-tree model.

In Part 2, we created a for loop to construct 9 trees for a set of parameters. Here we require R code for the for loop to be more concise, as a complete loop saves a set of trees in 9 rdata files filling up 200+ MB, and outputs 3 graphs each to a PDF file. To get the results saved to a CSV file quickly, we distill the for loop down to essentials only. Below is a plot for cumulative donations for the 9 trees we created previously. We will create more sets of 9 trees with different parameters, and will plot the average trees together to compare.

for Loop as a Function

You might remember the for loop from Part 2, where we included many outputs to track the progress of the loop. However, tracking the loop process is not our top priority here. Instead we want speed, because we will run a few sets of trees with different parameters, which will take more than a few minutes.

So what do we do? We simply recode the loop into a function, so we can change the parameters then call the designated function. I named it "makeTree", and it creates its own set of "allTotalDonation", "allAvgDonation", and "allDonationPercentile" matrices to store the results. Additionally, we can pass parameter arguments to the function, which is preset to the parameters we ran in the previous posts ("1000-400-4-10"). The "strParameters" will be newly initiated with the the default or any passed arguments.

Automated Decision Trees Code:

> # create makeTree function ####
> # run for multiple times and get the average result
> makeTree <- function(LoopNum=9, testSize=28624, nRec=95412, cost=0.68,
+                      MinSplit=1000, MinBucket=400,
+                      MaxSurrogate=4, MaxDepth=10) {
+   
+   # initialize parameters string
+   strParameters <- paste(MinSplit, MinBucket, MaxSurrogate,
+                          MaxDepth, sep="-")
+   
+   # create matrices for results
+   allTotalDonation <- matrix(0, nrow=testSize, ncol=LoopNum)
+   allAvgDonation <- matrix(0, nrow=testSize, ncol=LoopNum)
+   allDonationPercentile <- matrix (0, nrow=testSize, ncol=LoopNum)
+   cat("Begin Tree Building \n\n")
+   cat(paste("Parameters: ", strParameters, "\n\n", sep=""))
+   
+   for (loopCnt in 1:LoopNum) {
+     cat(date(), ": iteration = ", loopCnt, "\n")
+     
+     # split into training data and testing data
+     trainIdx <- sample(1:nRec, trainSize)
+     trainData <- cup98[trainIdx,]
+     testData <- cup98[-trainIdx,]
+     
+     # train a decision tree
+     cup.Ctree <- ctree(TARGET_D ~ ., data=trainData,
+                        controls=ctree_control(minsplit=MinSplit,
+                                               minbucket=MinBucket,
+                                               maxsurrogate=MaxSurrogate,
+                                               maxdepth=MaxDepth))
+     
+     # size of tree
+     print(object.size(cup.Ctree), units="auto")
+     
+     # test
+     pred <- predict(cup.Ctree, newdata=testData, type="response")
+     print(sum(testData$TARGET_D[pred > cost] - cost))
+     
+     # build donation matrices
+     # quicksort used to random tie values
+     s1 <- sort(pred, decreasing=TRUE, method="quick",
+                index.return=TRUE)
+     totalDonation <- cumsum(testData$TARGET_D[s1$ix]) # cumulative sum
+     avgDonation <- totalDonation / (1:testSize)
+     donationPercentile <- 100 * totalDonation / sum(testData$Target_D)
+     allTotalDonation[,loopCnt] <- totalDonation
+     allAvgDonation[,loopCnt] <- avgDonation
+     allDonationPercentile[,loopCnt] <- donationPercentile
+   }
+   cat(date(), ": Loop Completed. \n\n\n")
+   fnlTotalDonation <- rowMeans(allTotalDonation)
+   fnlAveDonation <- rowMeans(allAvgDonation)
+   fnlDonationPercentile <- rowMeans(allDonationPercentile)
+   
+   # writes all loop results into csv
+   rm(trainData, testData, pred)
+   results <- data.frame(cbind(allTotalDonation, fnlTotalDonation))
+   names(results) <- c(paste("run", 1:LoopNum), "Average")
+   write.csv(results, paste("evaluation-TotalDonation-", strParameters,
+                            ".csv", sep=""))
+ }

Note that the function also includes writing the CSV to working directory after the loop completes. The CSV will have the "evaluation-TotalDonation-" name followed by the specific "strParameters" ran by the loop.

Trees with Different Parameters

Now that we have our "makeTree" function, which parameters should we change from the "1000-400-4-10" we used previously? Recall the "strParameters" string consists of "MinSplit", "MinBucket", "MaxSurrogate", and "MaxDepth".

A quick recap of the parameters:

MinSplit: the minimum sum of weights in a node to be eligible for splitting
MinBucket: the minimum sum of weights in a terminal node
MaxSurrogate: number of surrogate splits to evaluate
MaxDepth: the maximum depth of the tree

We can change the "MinSplit" parameter from 1000 to 700, or to 200. We can also decrease the "MinBucket" from 400 to 200 and 50. By decreasing the "Split" and "Bucket" values, we allow the tree model to fit the data further, since the sum of weight limits will be lower. Another option is to change the "MaxDepth" from 10 to 8, 6, or 5. When we decrease the "depth" of the tree, we decrease the size of the tree, so there are less splits. The "MaxSurrogate" parameter we do not alter.

Let us split the different parameters into two groups, and tackle them separately. First we have the parameters where we change the "MaxDepth" only. Second, we have the parameters with changes to "MinSplit" and "MinBucket". How do we utilize the "makeTree" function here? By employing the for loop.

The first set of parameters we simply loop through the different "MaxDepth" parameters. The second set of parameters requires something different. We create a matrix of the new parameters and loop through them row by row (or by column depending on your matrix input). This generalized method also allows us to change any parameter any number of times. But since we are only changing two parameters each time, we do not alter the other two parameters, and use the defaults.

You can create a custom matrix for the different parameters you wish to pass to the "makeTree" function. So the "MaxDepth" parameters could have been included in the parameters matrix allowing you to run just one loop. Below we have the for loop code for the parameters we split into two sets.

Changing Parameters Code:

> ## looping makeTree fxn ####
> # changing MaxDepth
> parameters1 <- c("1000-400-4-5", "1000-400-4-6", "1000-400-4-8",
+                  "1000-400-4-10")
>
> # loop MaxDepth 5, 6, 8, 10
> for (i in c(5,6,8,10)){ 
+   MaxDepth <- i
+   makeTree(MaxDepth=MaxDepth)
+ }
>
> # changing MinSplit, MinBucket
> parameters2 <- c("1000-400-4-10", "700-200-4-10", "200-50-4-10")
> p2 <- matrix(c(1000, 400, 700, 200, 200, 50), ncol=2, byrow=TRUE)
> p2
     [,1] [,2]
[1,] 1000  400
[2,]  700  200
[3,]  200   50
>
> # loop p2 parameter matrix
> for (row in 1:dim(p2)[1]) {
+   MinSplit <- p2[row,1]
+   MinBucket <- p2[row,2]
+   makeTree(MinSplit=MinSplit,MinBucket=MinBucket)
+ }
>

After running those two loops (remember to load the party library in R), we now have seven CSV results files. To compare the different tree models, we analyze them graphically.

A Visual Comparison

We plot the results from the first set of parameters first. With the desired data column residing across four files, we use the string elements in the "parameters1" to expedite reading the CSV results files. In each CSV results file, we isolate the "Average" column, and account for the cost of the mail-in solicitation. We plot the first element in "parameters1", and add lines by looping through the rest of the "parameters1". This is the same for the second set of parameters, though it only has three elements in "parameters2". Make sure to add a legend to discern which lines describe which parameters.

Plotting Code:

> # visualizing the results ####
>
> # parameters1
> # plot first line
> results <- read.csv(paste("evaluation-TotalDonation-",
+ parameters1[1], ".csv", sep=""))
> avgResult <- results$Average - 0.68 * (1:dim(results)[1])
> plot(percentile, avgResult, pty=1, type="l", lty=1, col=1,
+      ylab="Amount of Donation", xlab="Contact Percentile (%)",
+      main="Parameters: MinSplit, MinBucket, MaxSurrogate,
+      MaxDepth", ylim=c(0,4000))
> grid(col="gray", lty="dotted")
> 
> # other lines
> for (i in 2:length(parameters1)) {
+   results <- read.csv(paste("evaluation-TotalDonation-", 
+                             parameters1[i], ".csv", sep=""))
+   avgResult <- results$Average - 0.68 * (1:dim(results)[1])
+   lines(percentile, avgResult, type="l", lty=i, col=i)
+ }
> legend("bottomright", col=1:length(parameters1), lty=1:length(parameters1),
+        legend=parameters1)
> 
>
> # parameters2
> # plot first line
> results <- read.csv(paste("evaluation-TotalDonation-", 
+                           parameters2[1], ".csv", sep=""))
> avgResult <- results$Average - 0.68 * (1:dim(results)[1])
> plot(percentile, avgResult, pty=1, type="l", lty=1, col=1,
+      ylab="Amount of Donation", xlab="Contact Percentile (%)",
+      main="Parameters: MinSplit, MinBucket, MaxSurrogate,
+      MaxDepth", ylim=c(0,4000))
> grid(col="gray", lty="dotted")
> 
> # other lines
> for (i in 2:length(parameters2)) {
+   results <- read.csv(paste("evaluation-TotalDonation-", 
+                             parameters2[i], ".csv", sep=""))
+   avgResult <- results$Average - 0.68 * (1:dim(results)[1])
+   lines(percentile, avgResult, type="l", lty=i, col=i)
+ }
> legend("bottomright", col=1:length(parameters2), lty=1:length(parameters2),
+        legend=parameters2)
>

Comparison of Parameters- 1

Above we have the varied "MaxDepth" plot, and below we have the different "MinSplit" and "MinBucket" plot.

Comparison of Parameters- 2

Now that we have plotted different parameters together, we can compare how well they modeled the KDD Cup data to predict donation profit outcomes.

Tree Selection

We want the tree model which best predicts the donations from our selected variables, so we can estimate how much donation profit we can expect to earn from future mail-in orders. Therefore, we can select for mailing only those fitting specific criteria, optimizing our donation profit.

Let us take a look at the first parameters plot where we varied the "MaxDepth". For the contact percentiles 20% and 40%, we observe the cumulative donations at those positions. The blue "1000-400-4-10" line reaches higher donation levels at similar contact percentile than the other lines of "MaxDepth" values. It appears that a higher "MaxDepth" allows the tree to model the data better than lower "MaxDepth" values. The black line with the lowest "MaxDepth" value of 5 has lower donation values at similar contact percentiles. This is more reason to use a higher "MaxDepth" value.

With the second parameters plot in mind, we can observe the outcome of changing the "MinSplit" and "MinBucket" parameters. As we decreased the value of the two parameters, we see that the cumulative donations as the green and red lines at earlier contact percentiles fall lower than the donations at similar contact percentiles with higher "MinSplit" and "MinBucket" values (1000-400). This is due to the sum of weights being higher so that the tree model does not overfit the data, since we are using the values from untrained test set. Thus the optimal "MinSplit" and "MinBucket" values are high enough such that they do not overfit the data and can be used to extrapolate on untrained test data.

So we will select the binary-tree model with parameters of "1000-400-4-10" for the "MinSplit, MinBucket, MaxSurrogate, MaxDepth" variables respectively. You can run more models with different parameters to get a better idea of how the parameters change the predictions.

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

More:
1. KDD Cup: Profit Optimization in R Part 1 : Exploring Data
2. KDD Cup: Profit Optimization in R Part 2 : Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation

Pages