Valence Analytics: KDD Cup: Profit Optimization in R Part 1: Exploring Data

Hello Readers,

Today we begin a case study on predicting and optimizing customer response and maximizing donations. The data were obtained from the Knowledge Discovery in Data (KDD) Cup's 1998 competition. KDD is a Special Interest Group of the Association of Computing Machinery (ACM). Here is a link to the annual KDD conference in 2014 titled "Data Mining for Social Good" in NYC.

KDD Logo

We will train decision trees to predict which customers donate the most, and optimize who to send the mail-in orders. Here we shall use the party library in R. The KDD data can be downloaded here. After you decompress the files, you will need cup98LRN.txt (training set) and cup98VAL.txt (test set).

Part 1

In this first post, we will explore the data, and visualize the distributions of the target variable and its relationship with other variables. In Part 2 I will cover building the decision trees, next in Part 3 I shall visualize the donation results, and in Part 4 I will finish by talking about selecting the best tree to maximize donations.

The 1998 KDD Data

First off, this is a large learning dataset with 95,412 rows and 481 variables at 117 MB uncompressed. Likewise, the test set is of similar dimension and size. Therefore we would have to explore the data and determine which variables we want to use to predict donations, and locate the donation amount variable. Note that there is a data dictionary detailing all the variables in the information section of the data tab.

To grasp what data we are dealing with, we use the describe() function from the Hmisc library. As you can see, along with mean, unique and missing values, median, and quantiles, it also displays the highest and lowest values. We pass the first 28 variables to describe(), which covers the demographic data. The output is quite lengthy, so if you want to run the other section of variables, you can uncomment them.

Describing Demographics Code:

# learning dataset
# 95,412 records, 481 fields
# 2 target variables
# load learning data ###
> cup98 <- read.csv("cup98LRN.txt", sep=",")

## 1. detailed description of the variables ####
# target variables
## TARGET_B (y-n donation) and TARGET_D (donation amount)
> library(Hmisc)
# describe(cup98[,1:28]) # demographics
# describe(cup98[,29:42]) # response to other types of mail orders
# describe(cup98[,43:55]) # overlay data
# describe(cup98[,56:74]) # donor interests
# describe(cup98[,75]) # PEP star RFA status
# describe(cup98[,76:361]) # characteristics of donor neighborhood
# describe(cup98[,362:407]) # promotion history
# describe(cup98[,408:412]) # summary variables of promotion history
# describe(cup98[,413:456]) # giving history
# describe(cup98[,457:469]) # summary variables of giving history
# describe(cup98[,470:473]) ## ID & TARGETS
# describe(cup98[,474:479]) # RFA (recency-frequency-donation amount)
# describe(cup98[,480:481]) # cluster & geocode
# names(cup98)

> describe(cup98[,1:28]) # demographics
cup98[, 1:28] 

 28  Variables      95412  Observations
-----------------------------------------------------------------------------------------------
ODATEDW 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
  95412       0      54    9141    8601    8601    8801    9201    9501    9601    9601 

lowest : 8306 8401 8501 8601 8604, highest: 9510 9511 9512 9601 9701 
-----------------------------------------------------------------------------------------------
OSOURCE 
      n missing  unique 
  95412       0     896 

lowest :     AAA AAD AAM ABC, highest: YAN YKA YKD YNF ZOY 
-----------------------------------------------------------------------------------------------
TCODE 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
  95412       0      55   54.22       0       0       0       1       2      28      28 

lowest :     0     1     2     3     4, highest: 24002 28028 39002 58002 72002 
-----------------------------------------------------------------------------------------------
STATE 
      n missing  unique 
  95412       0      57 

lowest : AA AE AK AL AP, highest: VT WA WI WV WY 
-----------------------------------------------------------------------------------------------
ZIP 
      n missing  unique 
  95412       0   19938 

lowest : 00801  00802  00820  00821  00840 , highest: 99901- 99925  99928  99928- 99950  
-----------------------------------------------------------------------------------------------
MAILCODE 
      n missing  unique 
  95412       0       2 

  (94013, 99%), B (1399, 1%) 
-----------------------------------------------------------------------------------------------
PVASTATE 
      n missing  unique 
  95412       0       3 

  (93954, 98%), E (5, 0%), P (1453, 2%) 
-----------------------------------------------------------------------------------------------
DOB 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
  95412       0     947    2724       0       0     201    2610    4601    5605    6201 

lowest :    0    1    2    4    5, highest: 9701 9704 9706 9708 9710 
-----------------------------------------------------------------------------------------------
NOEXCH 
      n missing  unique 
  95412       0       4 

  (7, 0%), 0 (95085, 100%), 1 (285, 0%), X (35, 0%) 
-----------------------------------------------------------------------------------------------
RECINHSE 
      n missing  unique 
  95412       0       2 

  (88709, 93%), X (6703, 7%) 
-----------------------------------------------------------------------------------------------
RECP3 
      n missing  unique 
  95412       0       2 

  (93395, 98%), X (2017, 2%) 
-----------------------------------------------------------------------------------------------
RECPGVG 
      n missing  unique 
  95412       0       2 

  (95298, 100%), X (114, 0%) 
-----------------------------------------------------------------------------------------------
RECSWEEP 
      n missing  unique 
  95412       0       2 

  (93795, 98%), X (1617, 2%) 
-----------------------------------------------------------------------------------------------
MDMAUD 
      n missing  unique 
  95412       0      28 

lowest : C1CM C1LM C1MM C2CM C2LM, highest: L1MM L2CM L2LM L2TM XXXX 
-----------------------------------------------------------------------------------------------
DOMAIN 
      n missing  unique 
  95412       0      17 

                 C1   C2   C3   R1    R2   R3    S1   S2   S3   T1    T2   T3   U1   U2   U3
Frequency 2316 6145 8264 5280 1358 13623 4809 11503 8530 1891 4982 12369 2176 4510 3254 2598
%            2    6    9    6    1    14    5    12    9    2    5    13    2    5    3    3
            U4
Frequency 1804
%            2
-----------------------------------------------------------------------------------------------
CLUSTER 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
  93096    2316      53   27.92       4       8      15      28      40      47      50 

lowest :  1  2  3  4  5, highest: 49 50 51 52 53 
-----------------------------------------------------------------------------------------------
AGE 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
  71747   23665      96   61.61    34.3    39.0    48.0    62.0    75.0    83.0    87.0 

lowest :  1  2  3  4  6, highest: 94 95 96 97 98 
-----------------------------------------------------------------------------------------------
AGEFLAG 
      n missing  unique 
  95412       0       3 

  (29548, 31%), E (57344, 60%), I (8520, 9%) 
-----------------------------------------------------------------------------------------------
HOMEOWNR 
      n missing  unique 
  95412       0       3 

  (22228, 23%), H (52354, 55%), U (20830, 22%) 
-----------------------------------------------------------------------------------------------
CHILD03 
      n missing  unique 
  95412       0       4 

  (94266, 99%), B (40, 0%), F (237, 0%), M (869, 1%) 
-----------------------------------------------------------------------------------------------
CHILD07 
      n missing  unique 
  95412       0       4 

  (93846, 98%), B (97, 0%), F (408, 0%), M (1061, 1%) 
-----------------------------------------------------------------------------------------------
CHILD12 
      n missing  unique 
  95412       0       4 

  (93601, 98%), B (142, 0%), F (520, 1%), M (1149, 1%) 
-----------------------------------------------------------------------------------------------
CHILD18 
      n missing  unique 
  95412       0       4 

  (92565, 97%), B (263, 0%), F (1142, 1%), M (1442, 2%) 
-----------------------------------------------------------------------------------------------
NUMCHLD 
      n missing  unique    Mean 
  12386   83026       7   1.528 

             1    2    3   4  5 6 7
Frequency 7792 3110 1101 316 59 7 1
%           63   25    9   3  0 0 0
-----------------------------------------------------------------------------------------------
INCOME 
      n missing  unique    Mean 
  74126   21286       7   3.886 

             1     2    3     4     5    6    7
Frequency 9022 13114 8558 12732 15451 7778 7471
%           12    18   12    17    21   10   10
-----------------------------------------------------------------------------------------------
GENDER 
      n missing  unique 
  95412       0       7 

               A C     F   J     M    U
Frequency 2957 2 2 51277 365 39094 1715
%            3 0 0    54   0    41    2
-----------------------------------------------------------------------------------------------
WEALTH1 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
  50680   44732      10   5.346       1       1       3       6       8       9       9 

             0    1    2    3    4    5    6    7    8    9
Frequency 2413 3454 4085 4237 4810 5280 5825 6198 6793 7585
%            5    7    8    8    9   10   11   12   13   15
-----------------------------------------------------------------------------------------------
HIT 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
  95412       0      75   3.321       0       0       0       0       3      11      17 

lowest :   0   1   2   3   4, highest:  75  79  84 240 241 
-----------------------------------------------------------------------------------------------

Our focus on the targets lead us to variables 470 to 473. They include the CONTROLN as the unique id number, TARGET_B as a binary donation indicator, TARGET_D as the donation amount in USD, HPHONE_D as a binary home phone number indicator.

Target: Donations

Let us take a look at the distribution of the donation variable, TARGET_D. From the describe() function we see the majority of the donations are small denominations, with 95% being 3 dollars or less. However, in the extreme values, we see 100+ dollar donations, peaking at $200.

Describing Donations Code:

> describe(cup98[,472])
cup98[, 472] 
      n missing  unique    Mean     .05     .10     .25     .50     .75 
  95412       0      71  0.7931       0       0       0       0       0 
    .90     .95 
      0       3 

lowest :   0.0   1.0   2.0   2.5   3.0
highest: 100.0 101.0 102.0 150.0 200.0

We can isolate the positive donations in cup98pos by using TARGET_B > 0, and visualize the modified donation distribution using a box plot. This way we only deal with those rows that donated some amount.

Plotting Donations Code:

> # positive donations- boxplot, description ####
> cup98pos <- cup98[cup98$TARGET_B > 0,]
> donations <- cup98pos$TARGET_D
> summary(donations)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   10.00   13.00   15.62   20.00  200.00 
> boxplot(donations, main="Positive Donations")

A large portion- between 25th and 75th percentiles- 50% of the values lie at or below 20 dollars. So we can conclude among those who donated, they usually donated $20 or less. However there were a few generous donations of $100-$200. The boxplot offers a fast way to view distributions. Another option is to cut() the donations into intervals and see how many rows fall into each interval.

We split the interval into multiples of 5, with the lowest from 0 to 0.1, and the highest covering 50-200. Observing the table of the donation intervals, we see the majority (n=90569) did not donate anything, while 110 donations fell into the highest category, 50-200 dollars. We plot both the donation intervals and the positive donation intervals to examine the distribution more closely.

Cut and Plot Code:

# create level factor donation variable ####
# [a,b), closed on left, open on right: (a <= x < b)
> cup98$TARGET_D2 <- cut(cup98$TARGET_D, right=FALSE,
                       breaks=c(0,0.1,10,15,20,25,30,50,max(cup98$TARGET_D)))
> cup98pos$TARGET_D2 <- cut(cup98pos$TARGET_D, right=FALSE,
                          breaks=c(0,0.1,10,15,20,25,30,50,max(cup98pos$TARGET_D)))
> layout(matrix(c(1,2),2,1))
> plot(table(cup98$TARGET_D2), main="All Donations", 
     xlab="Dollar Amount", ylab="Numer of Donations")
> plot(table(cup98pos$TARGET_D2), main="Positive Donations", 
     xlab="Dollar Amount", ylab="Numer of Donations")
> layout(1)
> table(cup98$TARGET_D2)

 [0,0.1) [0.1,10)  [10,15)  [15,20)  [20,25)  [25,30)  [30,50) [50,200) 
   90569     1132     1378      806      745      435      233      110

In the lower plot you notice the difference in only with those who donated due to the small y-axis limit so you can see the bars more clearly in the lower scale. The most frequent donation intervals include [0.1,10) and [10,15). Remember that [ ] are closed which include the number and ( ) are open, which do not include the number. So the interval [10,15) describes the interval 10 and up to, but not including 15.

Additionally, we can plot the discrete values piped from the table() function, which outputs a named vector of the number of donations for each unique donation value. This way we can visualize the number of donations for each dollar amount, instead of relying on intervals, as we did above.

Plot Code:

# discrete donation plot
> plot(table(cup98pos$TARGET_D), type="l", ylim=c(0,1000),
+ xlab="Donation Amount", ylab="Number of Donations",
+ main="Donation Distribution")

The tick marks on the x-axis denote actual donated amounts, and the y-axis show the number of donations at that donation amount. Observe the majority of the donations are below $27, while the highest 4 peaks are above 400- the 5th comes in close at 392, which is the $25 amount. The distribution is heavily skewed to the right, with the majority of the values less than the large extreme to the right.

Variable Selection

Because there are a large number of variables from which to select, going through each might be a hassle. Luckily, we have a data dictionary describing each variable. We will select 67 of the 481 variables available from the cup98 data. They include demographic variables, donor interests, promotion history, giving history, ID & TARGETs, and recency-frequency-donation amounts. Of course you can add other variables you think will be vital in predicting donation amounts.

Variable Selection Code:

> # select variables ####
> varSet <- c(
+   # demographics 160 R and Data Mining
+   "ODATEDW", "OSOURCE", "STATE", "ZIP", "PVASTATE", "DOB",
+   "RECINHSE", "MDMAUD", "DOMAIN", "CLUSTER", "AGE", "HOMEOWNR",
+   "CHILD03", "CHILD07", "CHILD12", "CHILD18", "NUMCHLD",
+   "INCOME", "GENDER", "WEALTH1", "HIT",
+   # donor interests
+   "COLLECT1", "VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS",
+   "CDPLAY", "STEREO", "PCOWNERS", "PHOTO", "CRAFTS", "FISHER",
+   "GARDENIN", "BOATS", "WALKER", "KIDSTUFF", "CARDS", "PLATES",
+   # PEP star RFA status
+   "PEPSTRFL",
+   # summary variables of promotion history
+   "CARDPROM", "MAXADATE", "NUMPROM", "CARDPM12", "NUMPRM12",
+   # summary variables of giving history
+   "RAMNTALL", "NGIFTALL", "CARDGIFT", "MINRAMNT", "MAXRAMNT",
+   "LASTGIFT", "LASTDATE", "FISTDATE", "TIMELAG", "AVGGIFT",
+   # ID & targets
+   "CONTROLN", "TARGET_B", "TARGET_D", "TARGET_D2", "HPHONE_D",
+   # RFA (Recency/Frequency/Donation Amount)
+   "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A",
+   #others
+   "CLUSTER2", "GEOCODE2")
> 
> # created new cup98 set ####
> cup98 <- cup98[, varSet]

Since we have created the new cup98 dataset, we are now ready to create the decision trees. But before we move on, let us explore some of the predictor variables included in the new cup98 data.

Variable Distribution

First, let us start with exploring age and donation amount. We will create a new positive donation dataset from the new cup98 data, and cut() the AGE variable into increments of 5. Then we will plot the donation stratified by the new age intervals to examine how each age grouping donated. Note that we restrict the donation plotted on the y-axis to a maximum of $40.

Donation Age Distribution Code:

# check distribution of donation in various age groups ####
> cup98pos <- cup98[cup98$TARGET_D>0,]
> age2 <- cut(cup98pos$AGE, right=FALSE, breaks=seq(0, 100, by=5))
table(age2)
> boxplot(cup98pos$TARGET_D ~ age2, ylim=c(0,40), las=3,
+ main="Donation Age Distribution")
# people aged 30..60 have higher median donation; in workforce

Examining the boxplots above, we see that the plots with intervals from 30 to 60 have the highest median donation amount. For practical purposes, the age of donation begins in the 15 to 20 age group, and continues all the way to the 95 to 100 group. A major reason why people aged 30 to 60 donate the higher amounts, can be attributed to their workforce status. People at those ages are most likely to have a job, and therefore has the disposable income to donate.

Next, we can look at the gender of donor with positive donations. We will focus on female, male and joint account donors. This time, with less categories, we can use a density plot to overlay the female, male, and joint plots. Note that we attach() the cup98pos to add it to the R's search path, so we do not have to type the name over and over. Again we restrict the donation amount to the lower donation band, but this time from 0 to 60 to capture the majority of the donations.

Donation-Gender Distribution Code:

# check distribution of donation for gender ####
> attach(cup98pos) 

## density plot for male-female-Joint donations
> plot(density(TARGET_D[GENDER=="F"]), xlim=c(0,60), col=1, lty=1,
     main="Donation Density Plot for Gender")
> lines(density(TARGET_D[GENDER=="M"]), col=2, lty=2)
> lines(density(TARGET_D[GENDER=="J"]), col=3, lty=3)
> legend("topright", c("Female", "Male", "Joint account"), 
       col=1:3, lty=1:3)

> detach(cup98pos)

Observe the 3 different colored and patterned lines for the female, male, and joint account densities. All three categories have the similar peak at 10 dollars, and both female and male have similar peaks. However, joint account donations do not have peaks at 5, 15, 20, or 25 amounts. Can you guess that the donation amounts are mainly in multiples of 5 (for donation options in the mail-in order)?

Variable Correlations

We can also check the correlations between the target donation variable and other numeric variables. First create an index indicating whether it is numeric or not. Then we correlate the target donation variable with those in cup98 using the numeric index. We make it easier by taking the absolute value of the correlation, and ordering them in decreasing order with the highest correlation first. This way we will observe the variables with the strongest correlation in the beginning.

Correlation Code:

# check correlation between target & numeric vars ####
> num.idx <- which(sapply(cup98, is.numeric))
> correlations <- cor(cup98$TARGET_D, cup98[,num.idx],
+ use="pairwise.complete.obs")
> correlations <- abs(correlations)
> (correlations <- correlations[,order(correlations, decreasing=TRUE)])
    TARGET_D     TARGET_B     LASTGIFT     RAMNTALL      AVGGIFT 
1.0000000000 0.7742323755 0.0616784458 0.0448101061 0.0442990841 
    MAXRAMNT       INCOME     CLUSTER2     NUMPRM12      WEALTH1 
0.0392237509 0.0320627023 0.0290870830 0.0251337775 0.0248673117 
    MINRAMNT     LASTDATE      NUMPROM      CLUSTER     CARDPM12 
0.0201578686 0.0188471021 0.0173371740 0.0171274879 0.0163577542 
     NUMCHLD     CONTROLN     CARDPROM     FISTDATE      ODATEDW 
0.0149204899 0.0133664439 0.0113023931 0.0075324932 0.0069484311 
         HIT     CARDGIFT     NGIFTALL     MAXADATE      TIMELAG 
0.0066483728 0.0064498822 0.0048990126 0.0044963520 0.0036115917 
         DOB     HPHONE_D          AGE       RFA_2F 
0.0027541472 0.0024315898 0.0022823598 0.0009047682

Of course TARGET_D correlates perfectly with itself, and TARGET_B correlates highly as well, being a yes/no indicator of donation. The next variables with higher correlations are LASTGIFT, RAMNTALL, and AVGGIFT, all three measures of previous giving history. This indicates (with common sense) that previous history is likely to be a strong predictor of the current donation amount.

We can create a scatter plot of the variable HIT, which measures the number of mail-order responses and age. This gives us an idea on how many responses were received from each age. We further stratify the plot by donation amount, changing the shape and color depending on a non-zero donation or not.

Plotting Donation Responses Code:

## scatter plot for donation response and age
> color <- ifelse(cup98$TARGET_D > 0, "blue", "black")
> pch <- ifelse(cup98$TARGET_D > 0, "+", ".")
> plot(jitter(cup98$AGE), jitter(cup98$HIT), pch=pch, col=color,
     cex=0.7, ylim=c(0,70), xlab="AGE", ylab="HIT",
     main="Mail Order Responses By Age and Donation Status")
> legend("topleft", c("TARGET_D>0", "TARGET_D=0"), 
       col=c("blue", "black"),
       pch=c("+", "."))

As we inspect the above plot, we notice the most mail-order responses come from older people, ages 60-80. However, those who donated did so in less responses. Though people would respond to the mail-in order, many which were sent back did not include a donation. There were donations in the responses from people who sent back less responses. It appears as if people want to donate, they would donate in the first few responses.

This post is getting lengthy already, so I will stop going through variables here, and continue the next post with building the decision trees to predict donations. Stay tuned for Part 2 of this Profit Optimization Case Study!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

More:
1. KDD Cup: Profit Optimization in R Part 1 : Exploring Data
2. KDD Cup: Profit Optimization in R Part 2 : Decision Trees
3. KDD Cup: Profit Optimization in R Part 3: Visualizing Results
4. KDD Cup: Profit Optimization in R Part 4: Selecting Trees
5. KDD Cup: Profit Optimization in R Part 5: Evaluation

1 comment:

Nelly CekaiiSat Dec 29, 04:48:00 AM
Bolavita Agen Sabung Ayam Live Online nomor satu di indonesia !

di www.bolavita.site Minimal Deposit hanya 50ribu saja untuk kamu yang ingin ikut Pasang Taruhan langsung secara Online !

Dan Di Bolavita kamu juga bisa menikmati Bonus deposit pertama kamu sebesar 10% !

Informasi Selengkapnya Hubungi Cs kami yang bertugas :
WA : +62812-2222-995
BBM : BOLAVITA

Aplikasi Live Chat Playstore / App Store : BOLAVITA Sabung Ayam

Valence Analytics

Pages

Saturday, May 24, 2014

KDD Cup: Profit Optimization in R Part 1: Exploring Data