Hello Readers,
Today in Part 3, we turn to a more robust method to predict bike sharing demand: generalized boosted model regression. Last time in Part 2, we began running a linear regression to create an initial prediction model to examine the strength of the predictors. To read about the bike sharing data from the Kaggle Knowledge Competition, click here for Part 1.
We also saw how the root mean squared logarithmic error (RMSLE) evaluated predicted "count" values that were lower or higher than the actual "count" value. So here we will explore how we can improve the RMSLE with a generalized boosted regression model of the bike sharing data.
Let's hop right into R.
Generalized Boosted Regression
Here we utilize GBM with the R library, "gbm", to run the models on the Kaggle bike sharing data. Remember we had to modify and transform some variables into proper format and factor levels, which was covered in Part 1. Then we pass the training variables and the training target, "count", through the"gbm()" function, along with other parameters, shown below.
gbm Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | > # load gbm library > library(gbm) > # make sure to set working directory > > # load training and test data > load("train.rdata") > test <- read.csv("test.csv") > > # gbm -base model #### > genmod<-gbm(train$count~. + ,data=train[,-c(1,9,10,11)] ## registered,casual,count columns + ,var.monotone=NULL # which vars go up or down with target + ,distribution="gaussian" + ,n.trees=1200 + ,shrinkage=0.05 + ,interaction.depth=3 + ,bag.fraction = 0.5 + ,train.fraction = 1 + ,n.minobsinnode = 10 + ,cv.folds = 10 + ,keep.data=TRUE + ,verbose=TRUE) Iter TrainDeviance ValidDeviance StepSize Improve 1 32158.0399 nan 0.0500 644.2250 2 31578.0942 nan 0.0500 585.4711 3 31044.4208 nan 0.0500 524.7365 4 30561.4195 nan 0.0500 478.6728 5 30154.5583 nan 0.0500 413.1806 6 29746.5141 nan 0.0500 395.9453 7 29373.1125 nan 0.0500 359.5060 8 29014.6034 nan 0.0500 343.1201 9 28703.5502 nan 0.0500 312.3433 10 28399.9495 nan 0.0500 282.9091 20 25870.2205 nan 0.0500 175.8114 40 23836.1755 nan 0.0500 52.8371 60 23063.7262 nan 0.0500 26.2731 80 22693.0915 nan 0.0500 4.3578 100 22450.1503 nan 0.0500 15.8378 120 22272.5560 nan 0.0500 2.6863 140 22122.7034 nan 0.0500 2.3877 160 21992.0298 nan 0.0500 12.2305 180 21878.9274 nan 0.0500 3.8828 200 21790.5052 nan 0.0500 1.1173 220 21692.7944 nan 0.0500 -0.5421 240 21619.6496 nan 0.0500 -0.3796 260 21549.5344 nan 0.0500 0.4666 280 21490.1193 nan 0.0500 1.0782 300 21435.0103 nan 0.0500 -3.0478 320 21355.7273 nan 0.0500 2.5832 340 21308.1967 nan 0.0500 -2.3362 360 21265.8199 nan 0.0500 -2.0662 380 21226.2996 nan 0.0500 -3.2661 400 21187.4927 nan 0.0500 -2.6117 420 21145.4325 nan 0.0500 -2.0814 440 21116.4270 nan 0.0500 -2.1384 460 21083.5071 nan 0.0500 -0.6912 480 21047.1991 nan 0.0500 -2.5004 500 21025.6535 nan 0.0500 -5.4226 520 20998.4169 nan 0.0500 -1.7574 540 20970.5290 nan 0.0500 -4.4251 560 20945.7996 nan 0.0500 -1.5079 580 20914.3748 nan 0.0500 -2.1149 600 20885.0842 nan 0.0500 -4.0596 620 20861.5043 nan 0.0500 -3.0287 640 20834.8791 nan 0.0500 -3.7761 660 20802.3307 nan 0.0500 -0.5195 680 20779.3478 nan 0.0500 -8.1288 700 20751.2605 nan 0.0500 -1.6472 720 20725.2298 nan 0.0500 -3.7279 740 20701.8625 nan 0.0500 -5.0591 760 20683.7046 nan 0.0500 -2.7938 780 20658.4463 nan 0.0500 0.7247 800 20637.1474 nan 0.0500 -3.3238 820 20617.2582 nan 0.0500 -1.8619 840 20593.8156 nan 0.0500 -2.6460 860 20569.5343 nan 0.0500 -2.7741 880 20548.0619 nan 0.0500 -2.5890 900 20527.9950 nan 0.0500 0.6394 920 20509.9647 nan 0.0500 -1.7151 940 20496.3642 nan 0.0500 -2.2759 960 20478.2300 nan 0.0500 -2.3908 980 20461.0466 nan 0.0500 -1.1577 1000 20440.3127 nan 0.0500 -0.7028 1020 20422.5978 nan 0.0500 -2.6813 1040 20409.9711 nan 0.0500 -0.9269 1060 20389.4390 nan 0.0500 -3.7635 1080 20376.5523 nan 0.0500 -8.8512 1100 20364.0977 nan 0.0500 -2.5027 1120 20352.5354 nan 0.0500 -2.6432 1140 20343.6042 nan 0.0500 -2.0281 1160 20330.3500 nan 0.0500 -2.9339 1180 20315.7748 nan 0.0500 -1.6376 1200 20307.0301 nan 0.0500 -4.4638 > |
Yes, there are many "gbm" parameters we tweak. One thing to note is the variables we used to predict the "count". In the model, we removed the "datetime" and the "windspeed" variables. I set the number of trees (iterations) to be 1200; the shrinkage to be 0.05 which is the learning rate with each expansion; the interaction depth at 3 for 3 way interactions; the bag fraction set at half of the training set for the next tree expansion; minimum observations was set to 10 in each node; the number of cross validation folds was set to 10; and the verbose parameter was set to true for the progress print-out.
The progress output of the model appears directly after I started the model in the Rconsole, every 20 iterations. It includes various measures for the training and test (if we specified one- that's why it's "nan"), step, and score.
Next we find the 'best' iteration out of the 1,200 by using the "gbm.perf()" function, and specifying cross validation as the "method". It returns the best iteration number, and a graph showing us the location of the iteration on the squared loss function (Figure 1). Then we pass the best iteration through and print out the tree of the 'best' iteration below. Additionally, we can call the "summary()" of the gbm object with the best iteration which prints and plots the variable influence (Figure 2).
Best Iteration Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | > # best iteration > # cv > best.iter <- gbm.perf(genmod,method="cv") ##the best iteration number > print(pretty.gbm.tree(genmod, best.iter)) SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 6 81.50000000 1 8 9 75714.60 5443 -0.05284263 1 4 33.21000000 2 6 7 60396.81 4392 -0.14406699 2 4 26.65000000 3 4 5 98849.76 4191 -0.10346147 3 -1 -0.23966561 -1 -1 -1 0.00 3188 -0.23966561 4 -1 0.32945856 -1 -1 -1 0.00 1003 0.32945856 5 -1 -0.10346147 -1 -1 -1 0.00 4191 -0.10346147 6 -1 -0.99072232 -1 -1 -1 0.00 201 -0.99072232 7 -1 -0.14406699 -1 -1 -1 0.00 4392 -0.14406699 8 -1 0.32837278 -1 -1 -1 0.00 1051 0.32837278 9 -1 -0.05284263 -1 -1 -1 0.00 5443 -0.05284263 > > summary(genmod, n.trees=best.iter) var rel.inf humidity humidity 34.2059372 atemp atemp 26.0106434 temp temp 22.6269229 season season 10.9486469 workingday workingday 3.5587532 weather weather 2.1290432 holiday holiday 0.5200532 |
Best Iteration Graphs:
Fig. 1: Pretty gbm |
We can see the best iteration in Figure 1 as the vertical dotted blue line, and yes it is close to 1,200. In fact, the exact iteration selected was 1,199. Also note the dramatic initial decrease in squared error loss in the first 100 iterations.
Fig. 2: Summary gbm |
Then we have the summary of the relevant variables in the gbm model in a plot, above. It indicates humidity (34%), 'feels like' temperature (26%), and temperature (22.6%) round out the top 3 variables.
Evaluation: RMLSE
Now we have our gbm model and our best iteration from the model. To compare the RMLSE for our gbm model, we go one step further and use the "test" data to predict the "count" target and submit it to Kaggle, instead of predicting from a random sample of training data.
If you have not read in the "test" data, do so now, and call "head" to get an idea of what variables it contains. This way we know which variables to subset out of the "predict()" function since our gbm model does not account for "datetime" or "windspeed". So so in the "test" set, variables 1 and 9 are taken out, and we observe negative values in our predicted data. That is not valid, since we cannot have a negative bike sharing "count" for that particular hour (row). So we fix this by imputing all negative predictions with a zero.
Creating Results Code:
1 2 3 4 5 6 7 8 9 10 11 | > # predict test data #### > # use best iteration from cv > pred.test <- predict(genmod, test[,-c(1,9)], best.iter, type="response") > summary(pred.test) Min. 1st Qu. Median Mean 3rd Qu. Max. -22.93 98.22 167.70 216.00 315.00 639.10 > pred.test[pred.test<0] <- 0 > # create output file > output <- data.frame(datetime=test[,1], count=pred.test) > write.csv(output, file="results.csv", quote=FALSE, row.names=FALSE) > |
Then we create a data.frame called "output", but you can name it what you like, with 2 variables: the "datetime" and the predicted "count"s. For Kaggle to accept the "output", we need to write it as a comma separated file, or CSV, with "write.csv()". However, we should get the "quote" parameter to FALSE so it does not place quotation marks around factors or strings, and the "row.names" to FALSE because the Kaggle submission format only has 2 columns. I changed the name to "results.csv".
The "results" CSV file looks like this when viewed in Notepad:
Submission Ready Results |
See the two columns of "datetime" and "count", separated by commas? Just what Kaggle prefers. Now log in to Kaggle and find the submission page on the Dashboard to the left. Submit it, and as you can see below, the submitted gbm model's RMSLE was less than Kaggle's Mean Value Benchmark (1.38 vs 1.58)!
Kaggle Leaderboad |
Fantastic result. Our gbm model clearly performed better than our linear regression model in Part 2, which had a double digit RMSLE score. However, our work is not finished yet. The gbm model parameters can be tweaked, and our variables selected and transformed in different ways to improve the RMSLE score.
As you can guess, that is for the next post, where we optimize our gbm model! So stay tuned for more R analysis in Part 3.
Thanks for reading,
Wayne
@beyondvalence
Capital Bikeshare Series:
1. Predicting Capital Bikeshare Demand in R: Part 1. Data Exploration
2. Predicting Capital Bikeshare Demand in R: Part 2. Regression
3. Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model
Hi, Thanks for the post, is it possible to explain this part as i am not able to generate this file
ReplyDeleteload("train.rdata")
Or if you can mail this to me on raj4u4all@gmail.com
Thanks in advance
Did you ever figure out how to generate the train.rdata file?
ReplyDeletesave(train, file="train.rdata")
ReplyDeleteHello,
ReplyDeleteFor gbm model, how many time required to build model? Actually i wait for near by 2 hr but model not build. Why is it happed? I don't know. plz help me
thanks in advanced.
I at last discovered extraordinary post here.I will get back here. I just added your blog to my bookmark locales. thanks.Quality presents is the vital on welcome the guests to visit the page, that is the thing that this website page is giving.
ReplyDeletedata science training in noida