Hello Readers,
Today we begin a case study on predicting and optimizing customer response and maximizing donations. The data were obtained from the Knowledge Discovery in Data (KDD) Cup's 1998 competition. KDD is a Special Interest Group of the Association of Computing Machinery (ACM). Here is a link to the annual KDD conference in 2014 titled "Data Mining for Social Good" in NYC.
KDD Logo |
We will train decision trees to predict which customers donate the most, and optimize who to send the mail-in orders. Here we shall use the party library in R. The KDD data can be downloaded here. After you decompress the files, you will need cup98LRN.txt (training set) and cup98VAL.txt (test set).
Part 1
In this first post, we will explore the data, and visualize the distributions of the target variable and its relationship with other variables. In Part 2 I will cover building the decision trees, next in Part 3 I shall visualize the donation results, and in Part 4 I will finish by talking about selecting the best tree to maximize donations.
The 1998 KDD Data
First off, this is a large learning dataset with 95,412 rows and 481 variables at 117 MB uncompressed. Likewise, the test set is of similar dimension and size. Therefore we would have to explore the data and determine which variables we want to use to predict donations, and locate the donation amount variable. Note that there is a data dictionary detailing all the variables in the information section of the data tab.
To grasp what data we are dealing with, we use the describe() function from the Hmisc library. As you can see, along with mean, unique and missing values, median, and quantiles, it also displays the highest and lowest values. We pass the first 28 variables to describe(), which covers the demographic data. The output is quite lengthy, so if you want to run the other section of variables, you can uncomment them.
Describing Demographics Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | # learning dataset # 95,412 records, 481 fields # 2 target variables # load learning data ### > cup98 <- read.csv("cup98LRN.txt", sep=",") ## 1. detailed description of the variables #### # target variables ## TARGET_B (y-n donation) and TARGET_D (donation amount) > library(Hmisc) # describe(cup98[,1:28]) # demographics # describe(cup98[,29:42]) # response to other types of mail orders # describe(cup98[,43:55]) # overlay data # describe(cup98[,56:74]) # donor interests # describe(cup98[,75]) # PEP star RFA status # describe(cup98[,76:361]) # characteristics of donor neighborhood # describe(cup98[,362:407]) # promotion history # describe(cup98[,408:412]) # summary variables of promotion history # describe(cup98[,413:456]) # giving history # describe(cup98[,457:469]) # summary variables of giving history # describe(cup98[,470:473]) ## ID & TARGETS # describe(cup98[,474:479]) # RFA (recency-frequency-donation amount) # describe(cup98[,480:481]) # cluster & geocode # names(cup98) > describe(cup98[,1:28]) # demographics cup98[, 1:28] 28 Variables 95412 Observations ----------------------------------------------------------------------------------------------- ODATEDW n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 54 9141 8601 8601 8801 9201 9501 9601 9601 lowest : 8306 8401 8501 8601 8604, highest: 9510 9511 9512 9601 9701 ----------------------------------------------------------------------------------------------- OSOURCE n missing unique 95412 0 896 lowest : AAA AAD AAM ABC, highest: YAN YKA YKD YNF ZOY ----------------------------------------------------------------------------------------------- TCODE n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 55 54.22 0 0 0 1 2 28 28 lowest : 0 1 2 3 4, highest: 24002 28028 39002 58002 72002 ----------------------------------------------------------------------------------------------- STATE n missing unique 95412 0 57 lowest : AA AE AK AL AP, highest: VT WA WI WV WY ----------------------------------------------------------------------------------------------- ZIP n missing unique 95412 0 19938 lowest : 00801 00802 00820 00821 00840 , highest: 99901- 99925 99928 99928- 99950 ----------------------------------------------------------------------------------------------- MAILCODE n missing unique 95412 0 2 (94013, 99%), B (1399, 1%) ----------------------------------------------------------------------------------------------- PVASTATE n missing unique 95412 0 3 (93954, 98%), E (5, 0%), P (1453, 2%) ----------------------------------------------------------------------------------------------- DOB n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 947 2724 0 0 201 2610 4601 5605 6201 lowest : 0 1 2 4 5, highest: 9701 9704 9706 9708 9710 ----------------------------------------------------------------------------------------------- NOEXCH n missing unique 95412 0 4 (7, 0%), 0 (95085, 100%), 1 (285, 0%), X (35, 0%) ----------------------------------------------------------------------------------------------- RECINHSE n missing unique 95412 0 2 (88709, 93%), X (6703, 7%) ----------------------------------------------------------------------------------------------- RECP3 n missing unique 95412 0 2 (93395, 98%), X (2017, 2%) ----------------------------------------------------------------------------------------------- RECPGVG n missing unique 95412 0 2 (95298, 100%), X (114, 0%) ----------------------------------------------------------------------------------------------- RECSWEEP n missing unique 95412 0 2 (93795, 98%), X (1617, 2%) ----------------------------------------------------------------------------------------------- MDMAUD n missing unique 95412 0 28 lowest : C1CM C1LM C1MM C2CM C2LM, highest: L1MM L2CM L2LM L2TM XXXX ----------------------------------------------------------------------------------------------- DOMAIN n missing unique 95412 0 17 C1 C2 C3 R1 R2 R3 S1 S2 S3 T1 T2 T3 U1 U2 U3 Frequency 2316 6145 8264 5280 1358 13623 4809 11503 8530 1891 4982 12369 2176 4510 3254 2598 % 2 6 9 6 1 14 5 12 9 2 5 13 2 5 3 3 U4 Frequency 1804 % 2 ----------------------------------------------------------------------------------------------- CLUSTER n missing unique Mean .05 .10 .25 .50 .75 .90 .95 93096 2316 53 27.92 4 8 15 28 40 47 50 lowest : 1 2 3 4 5, highest: 49 50 51 52 53 ----------------------------------------------------------------------------------------------- AGE n missing unique Mean .05 .10 .25 .50 .75 .90 .95 71747 23665 96 61.61 34.3 39.0 48.0 62.0 75.0 83.0 87.0 lowest : 1 2 3 4 6, highest: 94 95 96 97 98 ----------------------------------------------------------------------------------------------- AGEFLAG n missing unique 95412 0 3 (29548, 31%), E (57344, 60%), I (8520, 9%) ----------------------------------------------------------------------------------------------- HOMEOWNR n missing unique 95412 0 3 (22228, 23%), H (52354, 55%), U (20830, 22%) ----------------------------------------------------------------------------------------------- CHILD03 n missing unique 95412 0 4 (94266, 99%), B (40, 0%), F (237, 0%), M (869, 1%) ----------------------------------------------------------------------------------------------- CHILD07 n missing unique 95412 0 4 (93846, 98%), B (97, 0%), F (408, 0%), M (1061, 1%) ----------------------------------------------------------------------------------------------- CHILD12 n missing unique 95412 0 4 (93601, 98%), B (142, 0%), F (520, 1%), M (1149, 1%) ----------------------------------------------------------------------------------------------- CHILD18 n missing unique 95412 0 4 (92565, 97%), B (263, 0%), F (1142, 1%), M (1442, 2%) ----------------------------------------------------------------------------------------------- NUMCHLD n missing unique Mean 12386 83026 7 1.528 1 2 3 4 5 6 7 Frequency 7792 3110 1101 316 59 7 1 % 63 25 9 3 0 0 0 ----------------------------------------------------------------------------------------------- INCOME n missing unique Mean 74126 21286 7 3.886 1 2 3 4 5 6 7 Frequency 9022 13114 8558 12732 15451 7778 7471 % 12 18 12 17 21 10 10 ----------------------------------------------------------------------------------------------- GENDER n missing unique 95412 0 7 A C F J M U Frequency 2957 2 2 51277 365 39094 1715 % 3 0 0 54 0 41 2 ----------------------------------------------------------------------------------------------- WEALTH1 n missing unique Mean .05 .10 .25 .50 .75 .90 .95 50680 44732 10 5.346 1 1 3 6 8 9 9 0 1 2 3 4 5 6 7 8 9 Frequency 2413 3454 4085 4237 4810 5280 5825 6198 6793 7585 % 5 7 8 8 9 10 11 12 13 15 ----------------------------------------------------------------------------------------------- HIT n missing unique Mean .05 .10 .25 .50 .75 .90 .95 95412 0 75 3.321 0 0 0 0 3 11 17 lowest : 0 1 2 3 4, highest: 75 79 84 240 241 ----------------------------------------------------------------------------------------------- |
Our focus on the targets lead us to variables 470 to 473. They include the CONTROLN as the unique id number, TARGET_B as a binary donation indicator, TARGET_D as the donation amount in USD, HPHONE_D as a binary home phone number indicator.
Target: Donations
Let us take a look at the distribution of the donation variable, TARGET_D. From the describe() function we see the majority of the donations are small denominations, with 95% being 3 dollars or less. However, in the extreme values, we see 100+ dollar donations, peaking at $200.
Describing Donations Code:
1 2 3 4 5 6 7 8 9 | > describe(cup98[,472]) cup98[, 472] n missing unique Mean .05 .10 .25 .50 .75 95412 0 71 0.7931 0 0 0 0 0 .90 .95 0 3 lowest : 0.0 1.0 2.0 2.5 3.0 highest: 100.0 101.0 102.0 150.0 200.0 |
We can isolate the positive donations in cup98pos by using TARGET_B > 0, and visualize the modified donation distribution using a box plot. This way we only deal with those rows that donated some amount.
Plotting Donations Code:
1 2 3 4 5 6 7 | > # positive donations- boxplot, description #### > cup98pos <- cup98[cup98$TARGET_B > 0,] > donations <- cup98pos$TARGET_D > summary(donations) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 10.00 13.00 15.62 20.00 200.00 > boxplot(donations, main="Positive Donations") |
A large portion- between 25th and 75th percentiles- 50% of the values lie at or below 20 dollars. So we can conclude among those who donated, they usually donated $20 or less. However there were a few generous donations of $100-$200. The boxplot offers a fast way to view distributions. Another option is to cut() the donations into intervals and see how many rows fall into each interval.
We split the interval into multiples of 5, with the lowest from 0 to 0.1, and the highest covering 50-200. Observing the table of the donation intervals, we see the majority (n=90569) did not donate anything, while 110 donations fell into the highest category, 50-200 dollars. We plot both the donation intervals and the positive donation intervals to examine the distribution more closely.
Cut and Plot Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # create level factor donation variable #### # [a,b), closed on left, open on right: (a <= x < b) > cup98$TARGET_D2 <- cut(cup98$TARGET_D, right=FALSE, breaks=c(0,0.1,10,15,20,25,30,50,max(cup98$TARGET_D))) > cup98pos$TARGET_D2 <- cut(cup98pos$TARGET_D, right=FALSE, breaks=c(0,0.1,10,15,20,25,30,50,max(cup98pos$TARGET_D))) > layout(matrix(c(1,2),2,1)) > plot(table(cup98$TARGET_D2), main="All Donations", xlab="Dollar Amount", ylab="Numer of Donations") > plot(table(cup98pos$TARGET_D2), main="Positive Donations", xlab="Dollar Amount", ylab="Numer of Donations") > layout(1) > table(cup98$TARGET_D2) [0,0.1) [0.1,10) [10,15) [15,20) [20,25) [25,30) [30,50) [50,200) 90569 1132 1378 806 745 435 233 110 |
In the lower plot you notice the difference in only with those who donated due to the small y-axis limit so you can see the bars more clearly in the lower scale. The most frequent donation intervals include [0.1,10) and [10,15). Remember that [ ] are closed which include the number and ( ) are open, which do not include the number. So the interval [10,15) describes the interval 10 and up to, but not including 15.
Additionally, we can plot the discrete values piped from the table() function, which outputs a named vector of the number of donations for each unique donation value. This way we can visualize the number of donations for each dollar amount, instead of relying on intervals, as we did above.
Plot Code:
1 2 3 4 | # discrete donation plot > plot(table(cup98pos$TARGET_D), type="l", ylim=c(0,1000), + xlab="Donation Amount", ylab="Number of Donations", + main="Donation Distribution") |
The tick marks on the x-axis denote actual donated amounts, and the y-axis show the number of donations at that donation amount. Observe the majority of the donations are below $27, while the highest 4 peaks are above 400- the 5th comes in close at 392, which is the $25 amount. The distribution is heavily skewed to the right, with the majority of the values less than the large extreme to the right.
Variable Selection
Because there are a large number of variables from which to select, going through each might be a hassle. Luckily, we have a data dictionary describing each variable. We will select 67 of the 481 variables available from the cup98 data. They include demographic variables, donor interests, promotion history, giving history, ID & TARGETs, and recency-frequency-donation amounts. Of course you can add other variables you think will be vital in predicting donation amounts.
Variable Selection Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | > # select variables #### > varSet <- c( + # demographics 160 R and Data Mining + "ODATEDW", "OSOURCE", "STATE", "ZIP", "PVASTATE", "DOB", + "RECINHSE", "MDMAUD", "DOMAIN", "CLUSTER", "AGE", "HOMEOWNR", + "CHILD03", "CHILD07", "CHILD12", "CHILD18", "NUMCHLD", + "INCOME", "GENDER", "WEALTH1", "HIT", + # donor interests + "COLLECT1", "VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", + "CDPLAY", "STEREO", "PCOWNERS", "PHOTO", "CRAFTS", "FISHER", + "GARDENIN", "BOATS", "WALKER", "KIDSTUFF", "CARDS", "PLATES", + # PEP star RFA status + "PEPSTRFL", + # summary variables of promotion history + "CARDPROM", "MAXADATE", "NUMPROM", "CARDPM12", "NUMPRM12", + # summary variables of giving history + "RAMNTALL", "NGIFTALL", "CARDGIFT", "MINRAMNT", "MAXRAMNT", + "LASTGIFT", "LASTDATE", "FISTDATE", "TIMELAG", "AVGGIFT", + # ID & targets + "CONTROLN", "TARGET_B", "TARGET_D", "TARGET_D2", "HPHONE_D", + # RFA (Recency/Frequency/Donation Amount) + "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A", + #others + "CLUSTER2", "GEOCODE2") > > # created new cup98 set #### > cup98 <- cup98[, varSet] |
Since we have created the new cup98 dataset, we are now ready to create the decision trees. But before we move on, let us explore some of the predictor variables included in the new cup98 data.
Variable Distribution
First, let us start with exploring age and donation amount. We will create a new positive donation dataset from the new cup98 data, and cut() the AGE variable into increments of 5. Then we will plot the donation stratified by the new age intervals to examine how each age grouping donated. Note that we restrict the donation plotted on the y-axis to a maximum of $40.
Donation Age Distribution Code:
1 2 3 4 5 6 7 | # check distribution of donation in various age groups #### > cup98pos <- cup98[cup98$TARGET_D>0,] > age2 <- cut(cup98pos$AGE, right=FALSE, breaks=seq(0, 100, by=5)) table(age2) > boxplot(cup98pos$TARGET_D ~ age2, ylim=c(0,40), las=3, + main="Donation Age Distribution") # people aged 30..60 have higher median donation; in workforce |
Examining the boxplots above, we see that the plots with intervals from 30 to 60 have the highest median donation amount. For practical purposes, the age of donation begins in the 15 to 20 age group, and continues all the way to the 95 to 100 group. A major reason why people aged 30 to 60 donate the higher amounts, can be attributed to their workforce status. People at those ages are most likely to have a job, and therefore has the disposable income to donate.
Next, we can look at the gender of donor with positive donations. We will focus on female, male and joint account donors. This time, with less categories, we can use a density plot to overlay the female, male, and joint plots. Note that we attach() the cup98pos to add it to the R's search path, so we do not have to type the name over and over. Again we restrict the donation amount to the lower donation band, but this time from 0 to 60 to capture the majority of the donations.
Donation-Gender Distribution Code:
1 2 3 4 5 6 7 8 9 10 11 12 | # check distribution of donation for gender #### > attach(cup98pos) ## density plot for male-female-Joint donations > plot(density(TARGET_D[GENDER=="F"]), xlim=c(0,60), col=1, lty=1, main="Donation Density Plot for Gender") > lines(density(TARGET_D[GENDER=="M"]), col=2, lty=2) > lines(density(TARGET_D[GENDER=="J"]), col=3, lty=3) > legend("topright", c("Female", "Male", "Joint account"), col=1:3, lty=1:3) > detach(cup98pos) |
Observe the 3 different colored and patterned lines for the female, male, and joint account densities. All three categories have the similar peak at 10 dollars, and both female and male have similar peaks. However, joint account donations do not have peaks at 5, 15, 20, or 25 amounts. Can you guess that the donation amounts are mainly in multiples of 5 (for donation options in the mail-in order)?
Variable Correlations
We can also check the correlations between the target donation variable and other numeric variables. First create an index indicating whether it is numeric or not. Then we correlate the target donation variable with those in cup98 using the numeric index. We make it easier by taking the absolute value of the correlation, and ordering them in decreasing order with the highest correlation first. This way we will observe the variables with the strongest correlation in the beginning.
Correlation Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # check correlation between target & numeric vars #### > num.idx <- which(sapply(cup98, is.numeric)) > correlations <- cor(cup98$TARGET_D, cup98[,num.idx], + use="pairwise.complete.obs") > correlations <- abs(correlations) > (correlations <- correlations[,order(correlations, decreasing=TRUE)]) TARGET_D TARGET_B LASTGIFT RAMNTALL AVGGIFT 1.0000000000 0.7742323755 0.0616784458 0.0448101061 0.0442990841 MAXRAMNT INCOME CLUSTER2 NUMPRM12 WEALTH1 0.0392237509 0.0320627023 0.0290870830 0.0251337775 0.0248673117 MINRAMNT LASTDATE NUMPROM CLUSTER CARDPM12 0.0201578686 0.0188471021 0.0173371740 0.0171274879 0.0163577542 NUMCHLD CONTROLN CARDPROM FISTDATE ODATEDW 0.0149204899 0.0133664439 0.0113023931 0.0075324932 0.0069484311 HIT CARDGIFT NGIFTALL MAXADATE TIMELAG 0.0066483728 0.0064498822 0.0048990126 0.0044963520 0.0036115917 DOB HPHONE_D AGE RFA_2F 0.0027541472 0.0024315898 0.0022823598 0.0009047682 |
Of course TARGET_D correlates perfectly with itself, and TARGET_B correlates highly as well, being a yes/no indicator of donation. The next variables with higher correlations are LASTGIFT, RAMNTALL, and AVGGIFT, all three measures of previous giving history. This indicates (with common sense) that previous history is likely to be a strong predictor of the current donation amount.
We can create a scatter plot of the variable HIT, which measures the number of mail-order responses and age. This gives us an idea on how many responses were received from each age. We further stratify the plot by donation amount, changing the shape and color depending on a non-zero donation or not.
Plotting Donation Responses Code:
1 2 3 4 5 6 7 8 9 | ## scatter plot for donation response and age > color <- ifelse(cup98$TARGET_D > 0, "blue", "black") > pch <- ifelse(cup98$TARGET_D > 0, "+", ".") > plot(jitter(cup98$AGE), jitter(cup98$HIT), pch=pch, col=color, cex=0.7, ylim=c(0,70), xlab="AGE", ylab="HIT", main="Mail Order Responses By Age and Donation Status") > legend("topleft", c("TARGET_D>0", "TARGET_D=0"), col=c("blue", "black"), pch=c("+", ".")) |
As we inspect the above plot, we notice the most mail-order responses come from older people, ages 60-80. However, those who donated did so in less responses. Though people would respond to the mail-in order, many which were sent back did not include a donation. There were donations in the responses from people who sent back less responses. It appears as if people want to donate, they would donate in the first few responses.
This post is getting lengthy already, so I will stop going through variables here, and continue the next post with building the decision trees to predict donations. Stay tuned for Part 2 of this Profit Optimization Case Study!
Thanks for reading,
Wayne
@beyondvalence
More:
1. KDD Cup: Profit Optimization in R Part 1: Exploring Data
2. KDD Cup: Profit Optimization in R Part 2: Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation
Bolavita Agen Sabung Ayam Live Online nomor satu di indonesia !
ReplyDeletedi www.bolavita.site Minimal Deposit hanya 50ribu saja untuk kamu yang ingin ikut Pasang Taruhan langsung secara Online !
Dan Di Bolavita kamu juga bisa menikmati Bonus deposit pertama kamu sebesar 10% !
Informasi Selengkapnya Hubungi Cs kami yang bertugas :
WA : +62812-2222-995
BBM : BOLAVITA
Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam