Loading...

Thursday, July 10, 2014

Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model


Hello Readers,

Today in Part 3, we turn to a more robust method to predict bike sharing demand: generalized boosted model regression. Last time in Part 2, we began running a linear regression to create an initial prediction model to examine the strength of the predictors. To read about the bike sharing data from the Kaggle Knowledge Competition, click here for Part 1.


We also saw how the root mean squared logarithmic error (RMSLE) evaluated predicted "count" values that were lower or higher than the actual "count" value. So here we will explore how we can improve the RMSLE with a generalized boosted regression model of the bike sharing data.

Let's hop right into R.


Generalized Boosted Regression


Here we utilize GBM with the R library, "gbm", to run the models on the Kaggle bike sharing data. Remember we had to modify and transform some variables into proper format and factor levels, which was covered in Part 1. Then we pass the training variables and the training target, "count", through the"gbm()" function, along with other parameters, shown below.

gbm Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
> # load gbm library
> library(gbm)
> # make sure to set working directory
> 
> # load training and test data
> load("train.rdata")
> test <- read.csv("test.csv")
> 
> # gbm -base model ####
> genmod<-gbm(train$count~.
+             ,data=train[,-c(1,9,10,11)] ## registered,casual,count columns
+             ,var.monotone=NULL # which vars go up or down with target
+             ,distribution="gaussian"
+             ,n.trees=1200
+             ,shrinkage=0.05
+             ,interaction.depth=3
+             ,bag.fraction = 0.5
+             ,train.fraction = 1
+             ,n.minobsinnode = 10
+             ,cv.folds = 10
+             ,keep.data=TRUE
+             ,verbose=TRUE)

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1    32158.0399             nan     0.0500  644.2250
     2    31578.0942             nan     0.0500  585.4711
     3    31044.4208             nan     0.0500  524.7365
     4    30561.4195             nan     0.0500  478.6728
     5    30154.5583             nan     0.0500  413.1806
     6    29746.5141             nan     0.0500  395.9453
     7    29373.1125             nan     0.0500  359.5060
     8    29014.6034             nan     0.0500  343.1201
     9    28703.5502             nan     0.0500  312.3433
    10    28399.9495             nan     0.0500  282.9091
    20    25870.2205             nan     0.0500  175.8114
    40    23836.1755             nan     0.0500   52.8371
    60    23063.7262             nan     0.0500   26.2731
    80    22693.0915             nan     0.0500    4.3578
   100    22450.1503             nan     0.0500   15.8378
   120    22272.5560             nan     0.0500    2.6863
   140    22122.7034             nan     0.0500    2.3877
   160    21992.0298             nan     0.0500   12.2305
   180    21878.9274             nan     0.0500    3.8828
   200    21790.5052             nan     0.0500    1.1173
   220    21692.7944             nan     0.0500   -0.5421
   240    21619.6496             nan     0.0500   -0.3796
   260    21549.5344             nan     0.0500    0.4666
   280    21490.1193             nan     0.0500    1.0782
   300    21435.0103             nan     0.0500   -3.0478
   320    21355.7273             nan     0.0500    2.5832
   340    21308.1967             nan     0.0500   -2.3362
   360    21265.8199             nan     0.0500   -2.0662
   380    21226.2996             nan     0.0500   -3.2661
   400    21187.4927             nan     0.0500   -2.6117
   420    21145.4325             nan     0.0500   -2.0814
   440    21116.4270             nan     0.0500   -2.1384
   460    21083.5071             nan     0.0500   -0.6912
   480    21047.1991             nan     0.0500   -2.5004
   500    21025.6535             nan     0.0500   -5.4226
   520    20998.4169             nan     0.0500   -1.7574
   540    20970.5290             nan     0.0500   -4.4251
   560    20945.7996             nan     0.0500   -1.5079
   580    20914.3748             nan     0.0500   -2.1149
   600    20885.0842             nan     0.0500   -4.0596
   620    20861.5043             nan     0.0500   -3.0287
   640    20834.8791             nan     0.0500   -3.7761
   660    20802.3307             nan     0.0500   -0.5195
   680    20779.3478             nan     0.0500   -8.1288
   700    20751.2605             nan     0.0500   -1.6472
   720    20725.2298             nan     0.0500   -3.7279
   740    20701.8625             nan     0.0500   -5.0591
   760    20683.7046             nan     0.0500   -2.7938
   780    20658.4463             nan     0.0500    0.7247
   800    20637.1474             nan     0.0500   -3.3238
   820    20617.2582             nan     0.0500   -1.8619
   840    20593.8156             nan     0.0500   -2.6460
   860    20569.5343             nan     0.0500   -2.7741
   880    20548.0619             nan     0.0500   -2.5890
   900    20527.9950             nan     0.0500    0.6394
   920    20509.9647             nan     0.0500   -1.7151
   940    20496.3642             nan     0.0500   -2.2759
   960    20478.2300             nan     0.0500   -2.3908
   980    20461.0466             nan     0.0500   -1.1577
  1000    20440.3127             nan     0.0500   -0.7028
  1020    20422.5978             nan     0.0500   -2.6813
  1040    20409.9711             nan     0.0500   -0.9269
  1060    20389.4390             nan     0.0500   -3.7635
  1080    20376.5523             nan     0.0500   -8.8512
  1100    20364.0977             nan     0.0500   -2.5027
  1120    20352.5354             nan     0.0500   -2.6432
  1140    20343.6042             nan     0.0500   -2.0281
  1160    20330.3500             nan     0.0500   -2.9339
  1180    20315.7748             nan     0.0500   -1.6376
  1200    20307.0301             nan     0.0500   -4.4638
> 

Yes, there are many "gbm" parameters we tweak. One thing to note is the variables we used to predict the "count". In the model, we removed the "datetime" and the "windspeed" variables. I set the number of trees (iterations) to be 1200; the shrinkage to be 0.05 which is the learning rate with each expansion; the interaction depth at 3 for 3 way interactions; the bag fraction set at half of the training set for the next tree expansion; minimum observations was set to 10 in each node; the number of cross validation folds was set to 10; and the verbose parameter was set to true for the progress print-out.


The progress output of the model appears directly after I started the model in the Rconsole, every 20 iterations. It includes various measures for the training and test (if we specified one- that's why it's "nan"), step, and score.


Next we find the 'best' iteration out of the 1,200 by using the "gbm.perf()" function, and specifying cross validation as the "method". It returns the best iteration number, and a graph showing us the location of the iteration on the squared loss function (Figure 1). Then we pass the best iteration through and print out the tree of the 'best' iteration below. Additionally, we can call the "summary()" of the gbm object with the best iteration which prints and plots the variable influence (Figure 2).


Best Iteration Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> # best iteration
> # cv
> best.iter <- gbm.perf(genmod,method="cv") ##the best iteration number
> print(pretty.gbm.tree(genmod, best.iter))
  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight  Prediction
0        6   81.50000000        1         8           9       75714.60   5443 -0.05284263
1        4   33.21000000        2         6           7       60396.81   4392 -0.14406699
2        4   26.65000000        3         4           5       98849.76   4191 -0.10346147
3       -1   -0.23966561       -1        -1          -1           0.00   3188 -0.23966561
4       -1    0.32945856       -1        -1          -1           0.00   1003  0.32945856
5       -1   -0.10346147       -1        -1          -1           0.00   4191 -0.10346147
6       -1   -0.99072232       -1        -1          -1           0.00    201 -0.99072232
7       -1   -0.14406699       -1        -1          -1           0.00   4392 -0.14406699
8       -1    0.32837278       -1        -1          -1           0.00   1051  0.32837278
9       -1   -0.05284263       -1        -1          -1           0.00   5443 -0.05284263
> 
> summary(genmod, n.trees=best.iter)
                  var    rel.inf
humidity     humidity 34.2059372
atemp           atemp 26.0106434
temp             temp 22.6269229
season         season 10.9486469
workingday workingday  3.5587532
weather       weather  2.1290432
holiday       holiday  0.5200532

Best Iteration Graphs:
Fig. 1: Pretty gbm

We can see the best iteration in Figure 1 as the vertical dotted blue line, and yes it is close to 1,200. In fact, the exact iteration selected was 1,199. Also note the dramatic initial decrease in squared error loss in the first 100 iterations.



Fig. 2: Summary gbm

Then we have the summary of the relevant variables in the gbm model in a plot, above. It indicates humidity (34%), 'feels like' temperature (26%), and temperature (22.6%) round out the top 3 variables.



Evaluation: RMLSE


Now we have our gbm model and our best iteration from the model. To compare the RMLSE for our gbm model, we go one step further and use the "test" data to predict the "count" target and submit it to Kaggle, instead of predicting from a random sample of training data.

If you have not read in the "test" data, do so now, and call "head" to get an idea of what variables it contains. This way we know which variables to subset out of the "predict()" function since our gbm model does not account for "datetime" or "windspeed". So so in the "test" set, variables 1 and 9 are taken out, and we observe negative values in our predicted data. That is not valid, since we cannot have a negative bike sharing "count" for that particular hour (row). So we fix this by imputing all negative predictions with a zero.


Creating Results Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # predict test data ####
> # use best iteration from cv
> pred.test <- predict(genmod, test[,-c(1,9)], best.iter, type="response")
> summary(pred.test)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -22.93   98.22  167.70  216.00  315.00  639.10 
> pred.test[pred.test<0] <- 0
> # create output file
> output <- data.frame(datetime=test[,1], count=pred.test)
> write.csv(output, file="results.csv", quote=FALSE, row.names=FALSE)
> 

Then we create a data.frame called "output", but you can name it what you like, with 2 variables: the "datetime" and the predicted "count"s. For Kaggle to accept the "output", we need to write it as a comma separated file, or CSV, with "write.csv()". However, we should get the "quote" parameter to FALSE so it does not place quotation marks around factors or strings, and the "row.names" to FALSE because the Kaggle submission format only has 2 columns. I changed the name to "results.csv".


The "results" CSV file looks like this when viewed in Notepad:


Submission Ready Results 

See the two columns of "datetime" and "count", separated by commas? Just what Kaggle prefers. Now log in to Kaggle and find the submission page on the Dashboard to the left. Submit it, and as you can see below, the submitted gbm model's RMSLE was less than Kaggle's Mean Value Benchmark (1.38 vs 1.58)!



Kaggle Leaderboad

Fantastic result. Our gbm model clearly performed better than our linear regression model in Part 2, which had a double digit RMSLE score. However, our work is not finished yet. The gbm model parameters can be tweaked, and our variables selected and transformed in different ways to improve the RMSLE score. 


As you can guess, that is for the next post, where we optimize our gbm model! So stay tuned for more R analysis in Part 3.


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Capital Bikeshare Series:
1. Predicting Capital Bikeshare Demand in R: Part 1. Data Exploration
2. Predicting Capital Bikeshare Demand in R: Part 2. Regression
3. Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model

5 comments:

  1. Hi, Thanks for the post, is it possible to explain this part as i am not able to generate this file
    load("train.rdata")

    Or if you can mail this to me on raj4u4all@gmail.com

    Thanks in advance

    ReplyDelete
  2. Did you ever figure out how to generate the train.rdata file?

    ReplyDelete
  3. Hello,
    For gbm model, how many time required to build model? Actually i wait for near by 2 hr but model not build. Why is it happed? I don't know. plz help me
    thanks in advanced.

    ReplyDelete
  4. I at last discovered extraordinary post here.I will get back here. I just added your blog to my bookmark locales. thanks.Quality presents is the vital on welcome the guests to visit the page, that is the thing that this website page is giving.
    data science training in noida

    ReplyDelete