Wednesday, May 28, 2014

KDD Cup: Profit Optimization in R Part 2: Decision Trees

This post continues the Knowledge Discovery and Data mining Cup case study from Part 1, where we explored the distribution and relationships of the target and predictor variables. Here in Part 2, we will build the decision trees to predict the target donation variable with predictor variables using the party R library.

Recall that we are using decision trees to maximize returns (donations) from mail-in orders from many variables, including demographics, previous giving history, promotion history, and recency-frequency-donation variables. The data were used for the KDD 1998 Cup Competition.

cup98 Data

From Part 1 we created an abbreviated dataset consisting of 67 of the original 481 variables. Here we will subset more of the 67 variables down to 30 variables, including the target variable. The target variable is re-positioned as the first variable.

Training Set Code:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42``` ```## 2. Training Decision Trees #### > library(party) # recursive 'PARTY'tioning # create new set > varSet2 <- c("AGE", "AVGGIFT", "CARDGIFT", "CARDPM12", + "CARDPROM", "CLUSTER2", "DOMAIN", "GENDER", "GEOCODE2", "HIT", + "HOMEOWNR", "HPHONE_D", "INCOME", "LASTGIFT", "MAXRAMNT", + "MDMAUD_F", "MDMAUD_R", "MINRAMNT", "NGIFTALL", "NUMPRM12", + "PCOWNERS", "PEPSTRFL", "PETS", "RAMNTALL", "RECINHSE", + "RFA_2A", "RFA_2F", "STATE", "TIMELAG") > cup98 <- cup98[, c("TARGET_D", varSet2)] > str(cup98) 'data.frame': 95412 obs. of 30 variables: \$ TARGET_D: num 0 0 0 0 0 0 0 0 0 0 ... \$ AGE : int 60 46 NA 70 78 NA 38 NA NA 65 ... \$ AVGGIFT : num 7.74 15.67 7.48 6.81 6.86 ... \$ CARDGIFT: int 14 1 14 7 8 3 8 4 8 1 ... \$ CARDPM12: int 6 6 6 6 10 6 4 6 6 4 ... \$ CARDPROM: int 27 12 26 27 43 15 26 14 29 11 ... \$ CLUSTER2: int 39 1 60 41 26 16 53 38 57 34 ... \$ DOMAIN : Factor w/ 17 levels " ","C1","C2",..: 12 8 6 6 9 12 12 12 6 11 ... \$ GENDER : Factor w/ 7 levels " ","A","C","F",..: 4 6 6 4 4 1 4 4 6 6 ... \$ GEOCODE2: Factor w/ 6 levels ""," ","A","B",..: 5 3 5 5 3 5 6 5 6 4 ... \$ HIT : int 0 16 2 2 60 0 0 1 0 0 ... \$ HOMEOWNR: Factor w/ 3 levels " ","H","U": 1 2 3 3 2 1 2 3 3 1 ... \$ HPHONE_D: int 0 0 1 1 1 0 1 1 1 0 ... \$ INCOME : int NA 6 3 1 3 NA 4 2 3 NA ... \$ LASTGIFT: num 10 25 5 10 15 15 11 11 22 15 ... \$ MAXRAMNT: num 12 25 16 11 15 16 12 11 22 15 ... \$ MDMAUD_F: Factor w/ 4 levels "1","2","5","X": 4 4 4 4 4 4 4 4 4 4 ... \$ MDMAUD_R: Factor w/ 5 levels "C","D","I","L",..: 5 5 5 5 5 5 5 5 5 5 ... \$ MINRAMNT: num 5 10 2 2 3 10 3 5 10 3 ... \$ NGIFTALL: int 31 3 27 16 37 4 14 5 11 3 ... \$ NUMPRM12: int 14 13 14 14 25 12 9 12 12 9 ... \$ PCOWNERS: Factor w/ 2 levels " ","Y": 1 1 1 1 1 1 2 1 1 1 ... \$ PEPSTRFL: Factor w/ 2 levels " ","X": 2 1 2 2 1 2 2 1 2 1 ... \$ PETS : Factor w/ 2 levels " ","Y": 1 1 1 1 1 1 2 1 1 1 ... \$ RAMNTALL: num 240 47 202 109 254 51 107 31 199 28 ... \$ RECINHSE: Factor w/ 2 levels " ","X": 1 1 1 1 2 1 1 1 1 1 ... \$ RFA_2A : Factor w/ 4 levels "D","E","F","G": 2 4 2 2 3 3 2 2 3 3 ... \$ RFA_2F : int 4 2 4 4 2 1 1 3 1 1 ... \$ STATE : Factor w/ 57 levels "AA","AE","AK",..: 20 9 33 9 14 4 21 24 18 48 ... \$ TIMELAG : int 4 18 12 9 14 6 4 6 8 7 ... ```

Setting Parameters

Before we train the decision trees, we need to set the parameters of the trees. The party library allows us to create trees with recursive binary 'party'tioning. First we determine the test (0.3) and training set (0.7) sizes to be created from the learning data. Then we set the "MinSplit" variable to 1000, the "MinBucket" to 400, the "MaxSurrogate" to 4, and the "MaxDepth" to 10. "MinSplit" is the minimum sum of weights in a node to be eligible to splitting; "MinBucket" is the minimum sum of weights in a terminal node; "MaxSurrogate" is the number of surrogate splits to evaluate; "MaxDepth" is the maximum depth of the tree. Surrogate splits are evaluated from other predictors after the best predictor is determined for splitting, and they are stored with each primary split.

Parameter Code:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16``` ```# set parameters > nRec <- dim(cup98)[1] > trainSize <- round(nRec*0.7) > testSize <- nRec - trainSize ## ctree parameters > MinSplit <- 1000 > MinBucket <- 400 > MaxSurrogate <- 4 > MaxDepth <- 10 # can change > (strParameters <- paste(MinSplit, MinBucket, MaxSurrogate, + MaxDepth, sep="-")) [1] "1000-400-4-10" # number of loops > LoopNum <- 9 ## cost for each contact is \$0.68 > cost <- 0.68 ```

Observe the string "strParameters" to capture the decision tree parameters. We also store the number of decision trees to generate in "LoopNum" as 9, and the cost of each mail-in order as 0.68 cents in "cost".

Looping Decision Trees

Why do you want to create a "strParameters" variable? This will become evident soon, and it involves being able to run additional decision trees under different parameters to test the predicted donation values.

Because we are creating multiple decision trees (LoopNum=9), I advocate using a for loop to iterate through each tree, saving and writing the data for each. In each iteration we shall incorporate the decision tree plot, as well as the plotted cumulative donation amount sorted in decreasing order.

The output will be written in a pdf file, using the pdf() function to start the R graphics device, and delineated by printing the "strParameters" and "LoopNumber" to track the loop output. At the end of each loop there will be 10 trees, with 9 being iterated and the 10th being the average of the 9. We take the average of the 9 trees in an attempt to eliminate partitioning errors. Since in a single tree the partitioning can distort the results of the test and training data, using 9 runs will incorporate different partitioning patterns.

So the first step is to open the pdf() graphics device, and set the out document name including the "strParameters". Then in the output we print out the parameters, and create the three result matrices for total donation, average donation, and donation percentile.

For Loop Decision Tree Code:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64``` ```## run 9 times for tree building and use avg result > pdf(paste("evaluation-tree-", strParameters, ".pdf", sep=""), + width=12, height=9, paper="a4r", pointsize=6) > cat(date(), "\n") > cat(" trainSize=", trainSize, ", testSize=", testSize, "\n") > cat(" MinSplit=", MinSplit, ", MinBucket=", MinBucket, + ", MaxSurrogate=", MaxSurrogate, ", MaxDepth=", MaxDepth, + "\n\n") # run for multiple times and get the average result > allTotalDonation <- matrix(0, nrow=testSize, ncol=LoopNum) > allAvgDonation <- matrix(0, nrow=testSize, ncol=LoopNum) > allDonationPercentile <- matrix (0, nrow=testSize, ncol=LoopNum) > for (loopCnt in 1:LoopNum) { > cat(date(), ": iteration = ", loopCnt, "\n") # split into training data and testing data > trainIdx <- sample(1:nRec, trainSize) > trainData <- cup98[trainIdx,] > testData <- cup98[-trainIdx,] # train a decision tree > cup.Ctree <- ctree(TARGET_D ~ ., data=trainData, + controls=ctree_control(minsplit=MinSplit, + minbucket=MinBucket, + maxsurrogate=MaxSurrogate, + maxdepth=MaxDepth)) # size of tree > print(object.size(cup.Ctree), units="auto") > save(cup.Ctree, file=paste("cup98-ctree-", strParameters, + "-run-", loopCnt, ".rdata", sep="")) > figTitle <- paste("Tree", loopCnt) > plot(cup.Ctree, main=figTitle, type="simple", + ip_args=list(pval=FALSE), ep_args=list(digits=0, abbreviate=TRUE), + tp_args=list(digits=2)) # print(cup.Ctree) # test > pred <- predict(cup.Ctree, newdata=testData, type="response") > plot(pred, testData\$TARGET_D) > print(sum(testData\$TARGET_D[pred > cost] - cost)) # build donation matrices # quicksort used to random tie values > s1 <- sort(pred, decreasing=TRUE, method="quick", + index.return=TRUE) > totalDonation <- cumsum(testData\$TARGET_D[s1\$ix]) # cumulative sum > avgDonation <- totalDonation / (1:testSize) > donationPercentile <- 100 * totalDonation / sum(testData\$Target_D) > allTotalDonation[,loopCnt] <- totalDonation > allAvgDonation[,loopCnt] <- avgDonation > allDonationPercentile[,loopCnt] <- donationPercentile > plot(totalDonation, type="l") > grid() } > graphics.off() > cat(date(), ": Loop Completed. \n\n\n") > fnlTotalDonation <- rowMeans(allTotalDonation) > fnlAveDonation <- rowMeans(allAvgDonation) > fnlDonationPercentile <- rowMeans(allDonationPercentile) ```

After we sample the index to create the training and test data, we run the tree using
ctree(), and save the binary-tree object to its "strParameters" and current loop number designation as a RDATA file. In the pdf file, we print the tree structure, a scatter plot of the predicted and test data donations, and a cumulative test donation plot ordered by predicted donation size. The last plot examines how the decision tree models large donations, with a high initial increase indicating a good fit. The three result matrices are filled by column as each loop is completed.

Console Output

Depending on your computer, the loop might take a moments to a few minutes. The output begins with printing the date and time and parameters of the decision tree loop initialization. Once the loop begins, it starts to print the time at the start of each loop with the loop counter number. Then prints the size of the binary-tree object size, and in a new line, the predicted total donation profit, not including predicted donations less than the cost (\$0.68). That segment repeats 9 times (the loop number), and prints the finished date and time in the end message.

Console Output Code:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49``` ```> cat(date(), "\n") Wed May 28 09:30:53 2014 > cat(" trainSize=", trainSize, ", testSize=", testSize, "\n") trainSize= 66788 , testSize= 28624 > cat(" MinSplit=", MinSplit, ", MinBucket=", MinBucket, + ", MaxSurrogate=", MaxSurrogate, ", MaxDepth=", MaxDepth, + "\n\n") MinSplit= 1000 , MinBucket= 400 , MaxSurrogate= 4 , MaxDepth= 10 Wed May 28 09:32:18 2014 : iteration = 1 25.6 Mb [1] 3772.87 Wed May 28 09:33:27 2014 : iteration = 2 6.5 Mb [1] 4297.76 Wed May 28 09:34:20 2014 : iteration = 3 29.2 Mb [1] 3483.43 Wed May 28 09:35:23 2014 : iteration = 4 31.8 Mb [1] 4056.56 Wed May 28 09:36:26 2014 : iteration = 5 24.6 Mb [1] 4258.46 Wed May 28 09:37:27 2014 : iteration = 6 30.2 Mb [1] 2584.28 Wed May 28 09:38:29 2014 : iteration = 7 26.1 Mb [1] 3515.26 Wed May 28 09:39:32 2014 : iteration = 8 21.4 Mb [1] 3706.14 Wed May 28 09:40:32 2014 : iteration = 9 34.9 Mb [1] 4426.24 > graphics.off() > cat(date(), ": Loop Completed. \n\n\n") Wed May 28 09:41:44 2014 : Loop Completed. > fnlTotalDonation <- rowMeans(allTotalDonation) > fnlAveDonation <- rowMeans(allAvgDonation) > fnlDonationPercentile <- rowMeans(allDonationPercentile) > rm(trainData, testData, pred) > results <- data.frame(cbind(allTotalDonation, fnlTotalDonation)) > names(results) <- c(paste("run", 1:LoopNum), "Average") > write.csv(results, paste("evaluation-TotalDonation-", strParameters, + ".csv", sep="")) > ```

The last portion of the console output code covers the average donation values over all 9 trees. Out of the "fnlTotalDonation", "fnlAveDonation", and "fnlDonationPercentile", we save the "allTotalDonation" and the "fnlTotalDonation". These variables include the cumulative donations and the average cumulative donations, saved in a CSV file designated by "strParameters".

PDF Output

Let us examine the output the code wrote to the pdf file. For our current "strParameters" we find the pdf file named "evaluation-tree-1000-400-4-10.pdf". Opening the file, we discover 9 sets of plots: a binary-tree plot, a scatter plot of predicted and test donation values, and a plot of the cumulative test donations ordered by predicted donation size.

Below I display the first set of plots of the first tree. We begin with the decision tree:

Notice that it first splits at the variable, "LastGift". And next is the regression plot detailing test donations and predicted values:

We see the discrete predicted values on the x-axis versus the more continuous test donation values. There is not much noticeable correlation. And lastly the cumulative test donation plot, is indexed by decreasing predicted donations:

We have a steadily increasing cumulative donation plot, indicating that the tree model did an average job in predicting high donations. The actual donations were indexed by the highest predicted donations so actual high donations should have been plotted first on the x-axis, leading to a sharp increase in the y value at the beginning of the plot.

CSV Output

In our CSV output we saved the "allTotalDonation" and "fnlTotalDonation". The CSV file should be named "evaluation-TotalDonation-1000-400-4-10.csv". Using
read.csv() to read in the file, we print out the first fifteen rows of the results. The first column is the index, followed by nine columns of "run.x" loop number, and the average cumulative donation for all nine runs.

CSV Output Code:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27``` ```> ## 3. interpret results #### > results <- read.csv(paste("evaluation-TotalDonation-", strParameters, ".csv", sep="")) > results[1:15,] X run.1 run.2 run.3 run.4 run.5 run.6 run.7 run.8 run.9 Average 1 1 0 0 0 0 0 0 0 0 0 0.000000 2 2 0 0 10 0 0 0 0 0 0 1.111111 3 3 0 0 10 0 0 0 0 0 0 1.111111 4 4 0 0 10 0 0 0 0 0 0 1.111111 5 5 0 0 10 0 0 0 0 0 50 6.666667 6 6 0 0 10 0 0 0 0 0 50 6.666667 7 7 0 0 10 0 0 0 0 0 50 6.666667 8 8 0 0 10 0 0 0 0 0 50 6.666667 9 9 0 0 10 0 0 0 0 0 50 6.666667 10 10 0 13 10 0 0 0 0 0 50 8.111111 11 11 0 13 10 0 0 0 0 0 50 8.111111 12 12 0 13 10 0 0 0 0 0 50 8.111111 13 13 0 13 10 0 0 0 0 0 50 8.111111 14 14 0 13 10 0 0 0 0 0 50 8.111111 15 15 0 13 27 0 0 0 0 0 50 10.000000 > tail(results) X run.1 run.2 run.3 run.4 run.5 run.6 run.7 run.8 run.9 Average 28619 28619 23264.82 23762.08 22536.67 23885.04 23082.46 21699.42 22479 22477.25 23217.17 22933.77 28620 28620 23284.82 23762.08 22536.67 23885.04 23082.46 21699.42 22479 22477.25 23217.17 22935.99 28621 28621 23284.82 23762.08 22536.67 23885.04 23082.46 21699.42 22479 22477.25 23217.17 22935.99 28622 28622 23284.82 23762.08 22536.67 23885.04 23082.46 21699.42 22504 22477.25 23217.17 22938.77 28623 28623 23284.82 23762.08 22536.67 23885.04 23082.46 21699.42 22504 22477.25 23247.17 22942.10 28624 28624 23284.82 23762.08 22536.67 23885.04 23082.46 21699.42 22504 22477.25 23247.17 22942.10 ```

Next we print out the last six rows with
tail(), which displays the cumulative results of each run and the average cumulative donation (not including cost). The average total cumulative donation summed to \$22,942.10. Among the runs, ranged from \$21,699.42 for run six to \$23,885.02 for run four.

Again, this is turning into a lengthy post. After running the loop for one "strParameters" set, we have the predicted donation data. Being optimization, later we will run different "strParameters" to determine which set of models can predict the highest donation return. The next post will explore the donation results data with graphs, in preparation for selecting the best set of decision trees. Stay tuned for more posts!

Wayne
@beyondvalence

More:
1. KDD Cup: Profit Optimization in R Part 1: Exploring Data
2. KDD Cup: Profit Optimization in R Part 2: Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation

Saturday, May 24, 2014

KDD Cup: Profit Optimization in R Part 1: Exploring Data

Today we begin a case study on predicting and optimizing customer response and maximizing donations. The data were obtained from the Knowledge Discovery in Data (KDD) Cup's 1998 competition. KDD is a Special Interest Group of the Association of Computing Machinery (ACM). Here is a link to the annual KDD conference in 2014 titled "Data Mining for Social Good" in NYC.

 KDD Logo

We will train decision trees to predict which customers donate the most, and optimize who to send the mail-in orders. Here we shall use the party library in R. The KDD data can be downloaded here. After you decompress the files, you will need cup98LRN.txt (training set) and cup98VAL.txt (test set).

Part 1

In this first post, we will explore the data, and visualize the distributions of the target variable and its relationship with other variables. In Part 2 I will cover building the decision trees, next in Part 3 I shall visualize the donation results, and in Part 4 I will finish by talking about selecting the best tree to maximize donations.

The 1998 KDD Data

First off, this is a large learning dataset with 95,412 rows and 481 variables at 117 MB uncompressed. Likewise, the test set is of similar dimension and size. Therefore we would have to explore the data and determine which variables we want to use to predict donations, and locate the donation amount variable. Note that there is a data dictionary detailing all the variables in the information section of the data tab.

To grasp what data we are dealing with, we use the
describe() function from the Hmisc library. As you can see, along with mean, unique and missing values, median, and quantiles, it also displays the highest and lowest values. We pass the first 28 variables to describe(), which covers the demographic data. The output is quite lengthy, so if you want to run the other section of variables, you can uncomment them.

Describing Demographics Code:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211``` ```# learning dataset # 95,412 records, 481 fields # 2 target variables # load learning data ### > cup98 <- read.csv("cup98LRN.txt", sep=",") ## 1. detailed description of the variables #### # target variables ## TARGET_B (y-n donation) and TARGET_D (donation amount) > library(Hmisc) # describe(cup98[,1:28]) # demographics # describe(cup98[,29:42]) # response to other types of mail orders # describe(cup98[,43:55]) # overlay data # describe(cup98[,56:74]) # donor interests # describe(cup98[,75]) # PEP star RFA status # describe(cup98[,76:361]) # characteristics of donor neighborhood # describe(cup98[,362:407]) # promotion history # describe(cup98[,408:412]) # summary variables of promotion history # describe(cup98[,413:456]) # giving history # describe(cup98[,457:469]) # summary variables of giving history # describe(cup98[,470:473]) ## ID & TARGETS # describe(cup98[,474:479]) # RFA (recency-frequency-donation amount) # describe(cup98[,480:481]) # cluster & geocode # names(cup98) > describe(cup98[,1:28]) # demographics cup98[, 1:28] 28 Variables 95412 Observations ----------------------------------------------------------------------------------------------- ODATEDW n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 54 9141 8601 8601 8801 9201 9501 9601 9601 lowest : 8306 8401 8501 8601 8604, highest: 9510 9511 9512 9601 9701 ----------------------------------------------------------------------------------------------- OSOURCE n missing unique 95412 0 896 lowest : AAA AAD AAM ABC, highest: YAN YKA YKD YNF ZOY ----------------------------------------------------------------------------------------------- TCODE n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 55 54.22 0 0 0 1 2 28 28 lowest : 0 1 2 3 4, highest: 24002 28028 39002 58002 72002 ----------------------------------------------------------------------------------------------- STATE n missing unique 95412 0 57 lowest : AA AE AK AL AP, highest: VT WA WI WV WY ----------------------------------------------------------------------------------------------- ZIP n missing unique 95412 0 19938 lowest : 00801 00802 00820 00821 00840 , highest: 99901- 99925 99928 99928- 99950 ----------------------------------------------------------------------------------------------- MAILCODE n missing unique 95412 0 2 (94013, 99%), B (1399, 1%) ----------------------------------------------------------------------------------------------- PVASTATE n missing unique 95412 0 3 (93954, 98%), E (5, 0%), P (1453, 2%) ----------------------------------------------------------------------------------------------- DOB n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 947 2724 0 0 201 2610 4601 5605 6201 lowest : 0 1 2 4 5, highest: 9701 9704 9706 9708 9710 ----------------------------------------------------------------------------------------------- NOEXCH n missing unique 95412 0 4 (7, 0%), 0 (95085, 100%), 1 (285, 0%), X (35, 0%) ----------------------------------------------------------------------------------------------- RECINHSE n missing unique 95412 0 2 (88709, 93%), X (6703, 7%) ----------------------------------------------------------------------------------------------- RECP3 n missing unique 95412 0 2 (93395, 98%), X (2017, 2%) ----------------------------------------------------------------------------------------------- RECPGVG n missing unique 95412 0 2 (95298, 100%), X (114, 0%) ----------------------------------------------------------------------------------------------- RECSWEEP n missing unique 95412 0 2 (93795, 98%), X (1617, 2%) ----------------------------------------------------------------------------------------------- MDMAUD n missing unique 95412 0 28 lowest : C1CM C1LM C1MM C2CM C2LM, highest: L1MM L2CM L2LM L2TM XXXX ----------------------------------------------------------------------------------------------- DOMAIN n missing unique 95412 0 17 C1 C2 C3 R1 R2 R3 S1 S2 S3 T1 T2 T3 U1 U2 U3 Frequency 2316 6145 8264 5280 1358 13623 4809 11503 8530 1891 4982 12369 2176 4510 3254 2598 % 2 6 9 6 1 14 5 12 9 2 5 13 2 5 3 3 U4 Frequency 1804 % 2 ----------------------------------------------------------------------------------------------- CLUSTER n missing unique Mean .05 .10 .25 .50 .75 .90 .95 93096 2316 53 27.92 4 8 15 28 40 47 50 lowest : 1 2 3 4 5, highest: 49 50 51 52 53 ----------------------------------------------------------------------------------------------- AGE n missing unique Mean .05 .10 .25 .50 .75 .90 .95 71747 23665 96 61.61 34.3 39.0 48.0 62.0 75.0 83.0 87.0 lowest : 1 2 3 4 6, highest: 94 95 96 97 98 ----------------------------------------------------------------------------------------------- AGEFLAG n missing unique 95412 0 3 (29548, 31%), E (57344, 60%), I (8520, 9%) ----------------------------------------------------------------------------------------------- HOMEOWNR n missing unique 95412 0 3 (22228, 23%), H (52354, 55%), U (20830, 22%) ----------------------------------------------------------------------------------------------- CHILD03 n missing unique 95412 0 4 (94266, 99%), B (40, 0%), F (237, 0%), M (869, 1%) ----------------------------------------------------------------------------------------------- CHILD07 n missing unique 95412 0 4 (93846, 98%), B (97, 0%), F (408, 0%), M (1061, 1%) ----------------------------------------------------------------------------------------------- CHILD12 n missing unique 95412 0 4 (93601, 98%), B (142, 0%), F (520, 1%), M (1149, 1%) ----------------------------------------------------------------------------------------------- CHILD18 n missing unique 95412 0 4 (92565, 97%), B (263, 0%), F (1142, 1%), M (1442, 2%) ----------------------------------------------------------------------------------------------- NUMCHLD n missing unique Mean 12386 83026 7 1.528 1 2 3 4 5 6 7 Frequency 7792 3110 1101 316 59 7 1 % 63 25 9 3 0 0 0 ----------------------------------------------------------------------------------------------- INCOME n missing unique Mean 74126 21286 7 3.886 1 2 3 4 5 6 7 Frequency 9022 13114 8558 12732 15451 7778 7471 % 12 18 12 17 21 10 10 ----------------------------------------------------------------------------------------------- GENDER n missing unique 95412 0 7 A C F J M U Frequency 2957 2 2 51277 365 39094 1715 % 3 0 0 54 0 41 2 ----------------------------------------------------------------------------------------------- WEALTH1 n missing unique Mean .05 .10 .25 .50 .75 .90 .95 50680 44732 10 5.346 1 1 3 6 8 9 9 0 1 2 3 4 5 6 7 8 9 Frequency 2413 3454 4085 4237 4810 5280 5825 6198 6793 7585 % 5 7 8 8 9 10 11 12 13 15 ----------------------------------------------------------------------------------------------- HIT n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 75 3.321 0 0 0 0 3 11 17 lowest : 0 1 2 3 4, highest: 75 79 84 240 241 ----------------------------------------------------------------------------------------------- ```

Our focus on the targets lead us to variables 470 to 473. They include the CONTROLN as the unique id number, TARGET_B as a binary donation indicator, TARGET_D as the donation amount in USD, HPHONE_D as a binary home phone number indicator.

Target: Donations

Let us take a look at the distribution of the donation variable, TARGET_D. From the describe() function we see the majority of the donations are small denominations, with 95% being 3 dollars or less. However, in the extreme values, we see 100+ dollar donations, peaking at \$200.

Describing Donations Code:
 ```1 2 3 4 5 6 7 8 9``` ```> describe(cup98[,472]) cup98[, 472] n missing unique Mean .05 .10 .25 .50 .75 95412 0 71 0.7931 0 0 0 0 0 .90 .95 0 3 lowest : 0.0 1.0 2.0 2.5 3.0 highest: 100.0 101.0 102.0 150.0 200.0 ```

We can isolate the positive donations in cup98pos by using TARGET_B > 0, and visualize the modified donation distribution using a box plot. This way we only deal with those rows that donated some amount.

Plotting Donations Code:
 ```1 2 3 4 5 6 7``` ```> # positive donations- boxplot, description #### > cup98pos <- cup98[cup98\$TARGET_B > 0,] > donations <- cup98pos\$TARGET_D > summary(donations) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 10.00 13.00 15.62 20.00 200.00 > boxplot(donations, main="Positive Donations") ```

A large portion- between 25th and 75th percentiles- 50% of the values lie at or below 20 dollars. So we can conclude among those who donated, they usually donated \$20 or less. However there were a few generous donations of \$100-\$200. The boxplot offers a fast way to view distributions. Another option is to
cut() the donations into intervals and see how many rows fall into each interval.

We split the interval into multiples of 5, with the lowest from 0 to 0.1, and the highest covering 50-200. Observing the table of the donation intervals, we see the majority (n=90569) did not donate anything, while 110 donations fell into the highest category, 50-200 dollars. We plot both the donation intervals and the positive donation intervals to examine the distribution more closely.

Cut and Plot Code:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16``` ```# create level factor donation variable #### # [a,b), closed on left, open on right: (a <= x < b) > cup98\$TARGET_D2 <- cut(cup98\$TARGET_D, right=FALSE, breaks=c(0,0.1,10,15,20,25,30,50,max(cup98\$TARGET_D))) > cup98pos\$TARGET_D2 <- cut(cup98pos\$TARGET_D, right=FALSE, breaks=c(0,0.1,10,15,20,25,30,50,max(cup98pos\$TARGET_D))) > layout(matrix(c(1,2),2,1)) > plot(table(cup98\$TARGET_D2), main="All Donations", xlab="Dollar Amount", ylab="Numer of Donations") > plot(table(cup98pos\$TARGET_D2), main="Positive Donations", xlab="Dollar Amount", ylab="Numer of Donations") > layout(1) > table(cup98\$TARGET_D2) [0,0.1) [0.1,10) [10,15) [15,20) [20,25) [25,30) [30,50) [50,200) 90569 1132 1378 806 745 435 233 110 ```

In the lower plot you notice the difference in only with those who donated due to the small y-axis limit so you can see the bars more clearly in the lower scale. The most frequent donation intervals include [0.1,10) and [10,15). Remember that [ ] are closed which include the number and ( ) are open, which do not include the number. So the interval [10,15) describes the interval 10 and up to, but not including 15.

Additionally, we can plot the discrete values piped from the
table() function, which outputs a named vector of the number of donations for each unique donation value. This way we can visualize the number of donations for each dollar amount, instead of relying on intervals, as we did above.

Plot Code:
 ```1 2 3 4``` ```# discrete donation plot > plot(table(cup98pos\$TARGET_D), type="l", ylim=c(0,1000), + xlab="Donation Amount", ylab="Number of Donations", + main="Donation Distribution") ```

The tick marks on the x-axis denote actual donated amounts, and the y-axis show the number of donations at that donation amount. Observe the majority of the donations are below \$27, while the highest 4 peaks are above 400- the 5th comes in close at 392, which is the \$25 amount. The distribution is heavily skewed to the right, with the majority of the values less than the large extreme to the right.

Variable Selection

Because there are a large number of variables from which to select, going through each might be a hassle. Luckily, we have a data dictionary describing each variable. We will select 67 of the 481 variables available from the cup98 data. They include demographic variables, donor interests, promotion history, giving history, ID & TARGETs, and recency-frequency-donation amounts. Of course you can add other variables you think will be vital in predicting donation amounts.

Variable Selection Code:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27``` ```> # select variables #### > varSet <- c( + # demographics 160 R and Data Mining + "ODATEDW", "OSOURCE", "STATE", "ZIP", "PVASTATE", "DOB", + "RECINHSE", "MDMAUD", "DOMAIN", "CLUSTER", "AGE", "HOMEOWNR", + "CHILD03", "CHILD07", "CHILD12", "CHILD18", "NUMCHLD", + "INCOME", "GENDER", "WEALTH1", "HIT", + # donor interests + "COLLECT1", "VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", + "CDPLAY", "STEREO", "PCOWNERS", "PHOTO", "CRAFTS", "FISHER", + "GARDENIN", "BOATS", "WALKER", "KIDSTUFF", "CARDS", "PLATES", + # PEP star RFA status + "PEPSTRFL", + # summary variables of promotion history + "CARDPROM", "MAXADATE", "NUMPROM", "CARDPM12", "NUMPRM12", + # summary variables of giving history + "RAMNTALL", "NGIFTALL", "CARDGIFT", "MINRAMNT", "MAXRAMNT", + "LASTGIFT", "LASTDATE", "FISTDATE", "TIMELAG", "AVGGIFT", + # ID & targets + "CONTROLN", "TARGET_B", "TARGET_D", "TARGET_D2", "HPHONE_D", + # RFA (Recency/Frequency/Donation Amount) + "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A", + #others + "CLUSTER2", "GEOCODE2") > > # created new cup98 set #### > cup98 <- cup98[, varSet] ```

Since we have created the new cup98 dataset, we are now ready to create the decision trees. But before we move on, let us explore some of the predictor variables included in the new cup98 data.

Variable Distribution

First, let us start with exploring age and donation amount. We will create a new positive donation dataset from the new cup98 data, and cut() the AGE variable into increments of 5. Then we will plot the donation stratified by the new age intervals to examine how each age grouping donated. Note that we restrict the donation plotted on the y-axis to a maximum of \$40.

Donation Age Distribution Code:
 ```1 2 3 4 5 6 7``` ```# check distribution of donation in various age groups #### > cup98pos <- cup98[cup98\$TARGET_D>0,] > age2 <- cut(cup98pos\$AGE, right=FALSE, breaks=seq(0, 100, by=5)) table(age2) > boxplot(cup98pos\$TARGET_D ~ age2, ylim=c(0,40), las=3, + main="Donation Age Distribution") # people aged 30..60 have higher median donation; in workforce ```

Examining the boxplots above, we see that the plots with intervals from 30 to 60 have the highest median donation amount. For practical purposes, the age of donation begins in the 15 to 20 age group, and continues all the way to the 95 to 100 group. A major reason why people aged 30 to 60 donate the higher amounts, can be attributed to their workforce status. People at those ages are most likely to have a job, and therefore has the disposable income to donate.

Next, we can look at the gender of donor with positive donations. We will focus on female, male and joint account donors. This time, with less categories, we can use a density plot to overlay the female, male, and joint plots. Note that we
attach() the cup98pos to add it to the R's search path, so we do not have to type the name over and over. Again we restrict the donation amount to the lower donation band, but this time from 0 to 60 to capture the majority of the donations.

Donation-Gender Distribution Code:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```# check distribution of donation for gender #### > attach(cup98pos) ## density plot for male-female-Joint donations > plot(density(TARGET_D[GENDER=="F"]), xlim=c(0,60), col=1, lty=1, main="Donation Density Plot for Gender") > lines(density(TARGET_D[GENDER=="M"]), col=2, lty=2) > lines(density(TARGET_D[GENDER=="J"]), col=3, lty=3) > legend("topright", c("Female", "Male", "Joint account"), col=1:3, lty=1:3) > detach(cup98pos) ```

Observe the 3 different colored and patterned lines for the female, male, and joint account densities. All three categories have the similar peak at 10 dollars, and both female and male have similar peaks. However, joint account donations do not have peaks at 5, 15, 20, or 25 amounts. Can you guess that the donation amounts are mainly in multiples of 5 (for donation options in the mail-in order)?

Variable Correlations

We can also check the correlations between the target donation variable and other numeric variables. First create an index indicating whether it is numeric or not. Then we correlate the target donation variable with those in cup98 using the numeric index. We make it easier by taking the absolute value of the correlation, and ordering them in decreasing order with the highest correlation first. This way we will observe the variables with the strongest correlation in the beginning.

Correlation Code:
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18``` ```# check correlation between target & numeric vars #### > num.idx <- which(sapply(cup98, is.numeric)) > correlations <- cor(cup98\$TARGET_D, cup98[,num.idx], + use="pairwise.complete.obs") > correlations <- abs(correlations) > (correlations <- correlations[,order(correlations, decreasing=TRUE)]) TARGET_D TARGET_B LASTGIFT RAMNTALL AVGGIFT 1.0000000000 0.7742323755 0.0616784458 0.0448101061 0.0442990841 MAXRAMNT INCOME CLUSTER2 NUMPRM12 WEALTH1 0.0392237509 0.0320627023 0.0290870830 0.0251337775 0.0248673117 MINRAMNT LASTDATE NUMPROM CLUSTER CARDPM12 0.0201578686 0.0188471021 0.0173371740 0.0171274879 0.0163577542 NUMCHLD CONTROLN CARDPROM FISTDATE ODATEDW 0.0149204899 0.0133664439 0.0113023931 0.0075324932 0.0069484311 HIT CARDGIFT NGIFTALL MAXADATE TIMELAG 0.0066483728 0.0064498822 0.0048990126 0.0044963520 0.0036115917 DOB HPHONE_D AGE RFA_2F 0.0027541472 0.0024315898 0.0022823598 0.0009047682 ```

Of course TARGET_D correlates perfectly with itself, and TARGET_B correlates highly as well, being a yes/no indicator of donation. The next variables with higher correlations are LASTGIFT, RAMNTALL, and AVGGIFT, all three measures of previous giving history. This indicates (with common sense) that previous history is likely to be a strong predictor of the current donation amount.

We can create a scatter plot of the variable HIT, which measures the number of mail-order responses and age. This gives us an idea on how many responses were received from each age. We further stratify the plot by donation amount, changing the shape and color depending on a non-zero donation or not.

Plotting Donation Responses Code:
 ```1 2 3 4 5 6 7 8 9``` ```## scatter plot for donation response and age > color <- ifelse(cup98\$TARGET_D > 0, "blue", "black") > pch <- ifelse(cup98\$TARGET_D > 0, "+", ".") > plot(jitter(cup98\$AGE), jitter(cup98\$HIT), pch=pch, col=color, cex=0.7, ylim=c(0,70), xlab="AGE", ylab="HIT", main="Mail Order Responses By Age and Donation Status") > legend("topleft", c("TARGET_D>0", "TARGET_D=0"), col=c("blue", "black"), pch=c("+", ".")) ```

As we inspect the above plot, we notice the most mail-order responses come from older people, ages 60-80. However, those who donated did so in less responses. Though people would respond to the mail-in order, many which were sent back did not include a donation. There were donations in the responses from people who sent back less responses. It appears as if people want to donate, they would donate in the first few responses.

This post is getting lengthy already, so I will stop going through variables here, and continue the next post with building the decision trees to predict donations. Stay tuned for Part 2 of this Profit Optimization Case Study!