Pages

Sunday, June 1, 2014

KDD Cup: Profit Optimization in R Part 3: Visualizing Results


Hello Readers,

This post will focus on profit optimization from the 1998 Knowledge Discovery and Data mining Cup in R. Here we will visualize the results obtained from the set of decision trees created in Part 2



cup98 Part 3:


Recall in Part 2 we created 9 decision trees based on 'random' partitions of training and test data from the learning data set. We saved the binary tree objects, and plotted the tree, predicted and actual values, and cumulative donation for each tree. Now that we have saved the predicted donation results in a CSV file, we will go ahead and visualize our results to compare how well the set of trees performed against each other.


Accounting for Cost


Sending out all those mail-in orders soliciting for donations does cost money, in fact, it costs 68 cents a donation mailing. So we have to subtract the cost from the received donations to measure the profit. However we cannot subtract 68 cents from each row, we have to subtract the cumulative cost, because the donation values are cumulative. 

We also create an increment index, "idx.pos", to reduce the number of points plotted by a factor of 10 to make the plots more manageable by R, since there would be over 28,000 points per line. Now it would be 2,864- still plenty of points to capture the deviations in the plots. We will use the "idx.pos" as the index for the x and y variables in the following plots.

Cost and Results Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> # load parameters
> MinSplit <- 1000
> MinBucket <- 400
> MaxSurrogate <- 4
> MaxDepth <- 10 # can change
> (strParameters <- paste(MinSplit, MinBucket, MaxSurrogate,
+ MaxDepth, sep="-"))
[1] "1000-400-4-10"

> # load results with parameters
> results <- read.csv(paste("evaluation-TotalDonation-", strParameters, ".csv", sep=""))
>
> # create index increment by 10 and has last row number
> idx.pos <- c(seq(1, nrow(results), by=10), nrow(results))
>
> # note donation amounts are cumulative
> # modify donation amounts by cost
> cost <- 0.68
> results[,2:11] <- results[,2:11] - cost * (1:dim(results)[1])
> tail(results)
          X   run.1   run.2   run.3   run.4   run.5   run.6   run.7   run.8   run.9  Average
28619 28619 3803.90 4301.16 3075.75 4424.12 3621.54 2238.50 3018.08 3016.33 3756.25 3472.848
28620 28620 3823.22 4300.48 3075.07 4423.44 3620.86 2237.82 3017.40 3015.65 3755.57 3474.390
28621 28621 3822.54 4299.80 3074.39 4422.76 3620.18 2237.14 3016.72 3014.97 3754.89 3473.710
28622 28622 3821.86 4299.12 3073.71 4422.08 3619.50 2236.46 3041.04 3014.29 3754.21 3475.808
28623 28623 3821.18 4298.44 3073.03 4421.40 3618.82 2235.78 3040.36 3013.61 3783.53 3478.461
28624 28624 3820.50 4297.76 3072.35 4420.72 3618.14 2235.10 3039.68 3012.93 3782.85 3477.781

After accounting for cost, we see that the average return for the 9 runs amounted to $3,477.78. Quite a large change from donation revenue to donation profit!


Cumulative Donations


For our first visualization, we create a plot of cumulative test donations. The plot will consist of 10 lines, 9 cumulative donations for each tree and 1 for the average cumulative donations. Plot the average test line first, then add the other donation test lines in a for loop. Also, tack on a legend for clear measure- there are a lot of lines in this plot. Note that you can write the plots to a PDF file if you wish.

Plotting Cumulative Donations Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
> LoopNum <- 9
> # start pdf device ####
> pdf(paste("results-plots-", strParameters, ".pdf", sep=""),
+     width=12, height=9, paper="a4r", pointsize=10)
>
> # plotting cumulative donations ####
> plot(results[idx.pos, 11], type="l", lty=1, lwd=2, col=1, ylim=c(0,5000), 
+      xlab="Number of Mails", ylab="Amount of Donations ($)",
+      main="Cumulative Donations")
> for(fCnt in 1:LoopNum) {
+ lines(results[idx.pos,fCnt+1], pty=".", type="l", lty=1+fCnt,
+       col=1+fCnt)
> }
> legend("bottomright", col=1:c(LoopNum+1), lty=1:(LoopNum+1),
+       legend=c("Average", paste("Run",1:LoopNum)), cex=.68,
+       lwd=c(2,rep(1,LoopNum)))


The weighted black line is the average, and the other 9 lines are in a variety of colors and line types. At first glance we observe run 2 and run 6 as having slow initial increases in total donations. Run 5 performs relatively better than the other runs at the beginning. 


Because the cumulative test donations are indexed by decreasing predicted values, high initial donation values indicate a good model fit for large test donations. Sorted predictions from a good model would order the higher test donations in decreasing order. For poor models, the higher predicted values do not reflect higher test donations. So its initial predicted values do not reflect the largest test donations, and it orders higher test values more evenly, seen in its steady positive slope. Therefore with a slow initial increase, run 2 poorly models the test donation values, as it incorrectly predicts donation values. You can see run 2 increases steadily rather than quickly increase then plateaus like the other runs. It also exceeds the average total donation.


Donation Percentiles


Next we turn to donations viewed as percentiles. To obtain the percentile, we divide the value by the total value and multiply it by 100.
sapply() expedites this process by allowing us to apply a custom percentile function over the 9 run columns, and returns a matrix of donation percentiles for the y values. For the x values we create percentile values based on the number of donations.

So the percentile plot will consist of percentile of the total donation by percentile of mail contacts. You will see the percentile of donations might exceed 100%, because the donations received exceed the cost. There would be some mail-in percentiles below 100%, where the donation percentile is above 100%. So mailing less people might yield more than the final donation amount. Keep this in mind when we choose the best tree, since we want to reduce costs, while maximizing donation profits.

Plotting Donation Percentiles Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
> # plotting donation percentile ####
> donationPercentile <- sapply(2:11, function(i)
+   100 * results[,i] / results[dim(results)[1],i]
+   )
> percentile <- 100 * (1:dim(results)[1]) / dim(results)[1]
>
> plot(percentile[idx.pos], donationPercentile[idx.pos,10],
+      pty=".", type="l", lty=1, lwd=2, col=1, ylim=c(0,150),
+      xlab="Contact Percentile (%)", ylab="Donation Percentile (%)",
+      main="Donation Percentiles")
> grid(col="gray", lty="dotted")
>
> for(fCnt in 1:LoopNum) {
+   lines(percentile[idx.pos], donationPercentile[idx.pos,fCnt],
+         pty=".", type="l", lty=1+fCnt, col=1+fCnt)
+ }
> legend("bottomright", col=1:(LoopNum+1), lty=1:(LoopNum+1),
+        legend=c("Average", paste("Run", 1:LoopNum)),
+        lwd=c(2, rep(1,LoopNum)), cex=.68)




You can see run 6 clearly exceeding 100% donation percentile just beyond the 80% contact percentile (this is just one run though). So by mailing less people, say 40%-60% of the contacts, we might have been able to reach a higher final donation amount, since many runs rose above 100% in that contact percentile range. However, looking back at the absolute donations, we see that run 6 has a lower final total, so the percentages are magnified, and not as optimal as runs with higher final donation outcomes.



Donation Amount per Contact


Another visualization we can use focuses on the donation per contact percentile. Overlaying the percentile donation with the donation per contact percentile allows us to track from where the largest donations are coming. Since we observed run 6 performing well in the previous donation percentage graphic, we will use run 6 here.

First we require the average donations for all the runs and the average donation of the average of runs. To plot two different scale lines in a single graph, we need two titles for the y-axis columns, and the "new" argument to be set to true, par(new=TRUE).

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
> # plotting double y axis- average donation collected & amount
> avgDonation <- sapply(2:11, function(i)
+                       results[,i] / (1:dim(results)[1])
+                       )
> yTitle <- c("Total Donation Amount Percentile (%)",
+             "Average Donation Amount per Contact ($)")
> par(mar=c(5,4,4,5)+.1)
> plot(percentile[idx.pos], donationPercentile[idx.pos,6],
+      pty=".", type="l", lty="solid", col="red", ylab=yTitle[1],
+      xlab="Contact Percentile (%)")
> grid(col="gray", lty="dotted")
> par(new=TRUE)
> plot(percentile[idx.pos], avgDonation[idx.pos,6], type="l",
+      lty="dashed", col="blue", xaxt="n", yaxt="n", xlab="", ylab="",
+      ylim=c(0,max(avgDonation[,6])))
> axis(4)
> mtext(yTitle[2], side=4, line=2)
> legend("right", col=c("red", "blue"), lty=c("solid", "dashed"),
+        legend=yTitle)
> graphics.off()


Both lines describe run 6. The red line was seen in the previous percentile graph, and the blue line depicts the donation amount per contact percentile. We see a sharp initial increase in the donation percentile to start, then it tapers down towards zero. This is due to ordering of the donation values based on the predicted donation values. The larger donations will come first if the model fits well. 

There are slight positive fluctuations around 10% and 80% contacts, which mirror the increases in donation percentile at those percent contacts. As the contacts increase, the average donation drops due to the large number of contacts.

Now we have a better idea of our results derived from our set of nine decision trees we created in part 2. In the next parts, we examine running more tree sets on different parameters, and selecting the best tree.



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

More:
1. KDD Cup: Profit Optimization in R Part 1: Exploring Data
2. KDD Cup: Profit Optimization in R Part 2: Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation

1 comment:

  1. Bolavita Agen Sabung Ayam Live Online nomor satu di indonesia !

    di www.bolavita.site Minimal Deposit hanya 50ribu saja untuk kamu yang ingin ikut Pasang Taruhan langsung secara Online !

    Dan Di Bolavita kamu juga bisa menikmati Bonus deposit pertama kamu sebesar 10% !

    Informasi Selengkapnya Hubungi Cs kami yang bertugas :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam

    ReplyDelete