Loading...

Wednesday, July 2, 2014

Predicting Capital Bikeshare Demand in R: Part 2. Regression


Hello Readers,

Today we continue our Kaggle Knowledge Discovery series of predicting bike sharing demand from Washington D.C.'s Capital Bikeshare program. Since we explored the data, and visually stratified our target "count" variable in Part 1, here we progress by generating a predictive model.



This R post is a direct continuation from the previous Part 1.


Linear Regression


Since this is the first time we will model the "count" target, we will use basic regression and all the covariates as predictors to get an idea of how well the variables predict the "count". In Part 1, we did see (weak) visual correlations with "season", "weather", and "temperature". We will remember to examine their coefficients in the regression model summary. I ran the regression using the code on our train dataset below:

Linear Regression Code & Output:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
> names(train)
 [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
 [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
[11] "registered" "count" 

# running linear regression model
# exclude #casual and # registered
> train.lm <- lm(count ~ ., data=train[,-c(10,11)])
> summary(train.lm)

Call:
lm(formula = count ~ ., data = train[, -c(10, 11)])

Residuals:
    Min      1Q  Median      3Q     Max 
-326.85  -98.03  -24.58   71.78  656.91 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.066e+03  1.253e+02 -24.476  < 2e-16 ***
datetime     2.454e-06  9.043e-08  27.136  < 2e-16 ***
season2     -1.297e+01  5.238e+00  -2.476  0.01332 *  
season3     -6.191e+01  6.742e+00  -9.183  < 2e-16 ***
season4      1.037e+01  4.837e+00   2.145  0.03198 *  
holiday1    -8.174e+00  8.882e+00  -0.920  0.35740    
workingday1 -2.138e+00  3.177e+00  -0.673  0.50108    
weather.L   -9.431e+01  1.001e+02  -0.942  0.34631    
weather.Q    7.285e+01  7.467e+01   0.976  0.32926    
weather.C   -4.619e+01  3.360e+01  -1.375  0.16924    
temp         6.706e+00  1.168e+00   5.741 9.67e-09 ***
atemp        3.291e+00  1.024e+00   3.213  0.00132 ** 
humidity    -2.673e+00  9.068e-02 -29.472  < 2e-16 ***
windspeed    8.439e-01  1.927e-01   4.380 1.20e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 149.2 on 10872 degrees of freedom
Multiple R-squared:  0.3226, Adjusted R-squared:  0.3218 
F-statistic: 398.3 on 13 and 10872 DF,  p-value: < 2.2e-16

Make sure to exclude the user demand variables "casual" and "registered" because they add to total "count" demand, and we do not have those variables available in the test set. Or else the 'prediction' would simply be an easy sum of the two variables.


Looking at the output, we see a wide range of residuals (-326.85, 656.91) for the target variable- not a good indicator. In the coefficients section, we see the factor variables transformed into dummy variables, relative to the first level. Comparing the graphical correlations from Part 1, the model informs us that "datetime", "season3" relative to "season1", "temp", "humidity", and "windspeed" predict the target "count" better than the other variables.

However not that well, since the R-squared was only 0.3226, which means the linear model accounts for roughly 32% of the variation in "count". But do not worry, this is only a preliminary model, we will use more extravagant techniques in Part 3. This basic regression is simply a benchmark model, so we can see how the variables are performing, and how well the future models have improved (or receded) in predictive performance.


Evaluation Metric- RMSLE


Submissions to Kaggle require us to submit the predicted "count" variable along with the "datetime" stamp. But how do they score the submissions to create the leaderboards? In other words, how do they determine whose model performed better?

This is where they use a metric to score the prediction submissions with the actual counts. Sometimes we use mean squared error, sometimes we use ROC curves, other times we use something else, like the root mean squared logarithmic error (RMSLE).

The RMSLE equation is evaluated like so:

1ni=1n(log(pi+1)log(ai+1))2

with n = number of hours in test set, p = predicted count, a = actual count, and log(x) is the natural log (ln). The RMSLE penalizes extreme prediction values, so the RMSLE will be higher the farther the predicted values are from the actual. Let's see how this works visually:


Understanding RMSLE Code:
1
2
3
4
5
> # RMSLE-like values for various predictions
> p <- seq(1,100, by=1)
> a <- rep(40,100)
> rmsle <- ((log(p+1)-log(a+1))^2)^0.5
> plot(p,rmsle, type="l", main="Basic RMSLE Pattern", xlab="Predicted")


As you can see, the actual value at 40 yields a RMSLE of 0. It generates higher RMSLE values for predictions lower than the actual, compared to predictions higher than the actual value. Lower values will be penalized more, since (log((p+1)/(a+1))), and lim(x->0) log(x) goes to negative infinity (positive infinity in RMSLE) the smaller prediction 'p' is compared to actual 'a'. Either way, the RMSLE generates higher values for larger discrepancies between predicted and actual values.


Model Evaluation


We can sample 1000 random hours as our test set, so that we can calculate the RMSLE with the predicted and actual counts.

1000 Sample Predictions & RMSLE Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# total hours in training set
> nrow(train)
[1] 10886
> 
> # predict using 1000 samples
> i.test <- sample(1:nrow(train), 1000)
> # create test variables
> test.1 <- train[i.test,1:9]
> # create test target variable
> test.1.target <- train[i.test,12]
> # predict $count using test variables
> test.1.pred <- predict(train.lm, newdata=test.1)
> 
> summary(test.1.pred) # has negatives
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -104.1   110.8   186.9   186.1   258.0   503.1 
> 
> # n, number of test samples
> n <- 1000
> # eliminate negative predictions
> test.1.pred[test.1.pred<0] <- 0
> 
> summary(test.1.pred)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   110.8   186.9   187.3   258.0   503.1 
> 
> 
> # root mean squared log error (RMSLE)
> # [(1/n)*sum(log(p+1)-log(a-1))^2]^0.5
> # n = number of hours
> # p = predicted count; a = actual count
> # log = ln
> 
> (test.1.rmsle <- ((1/n)*sum(log(test.1.pred+1)-log(test.1.target+1))^2)^0.5)
[1] 11.80803
> 

To create our 'test' set, we sample 1000 random indexes from 10,886 rows (each an hour), and use the index to isolate 1000 predictor variables and 1000 target variables (actual values). Then we use
predict() to create prediction "count" values from our linear regression model. However, from the summary of the predicted values, we see (sadly) that there are negative values- which is out of the "count" variable boundary. So we simple assume that the predicted bike demand during those hours will be zero.

Then we run the RMSLE with the modified predicted values and the actual "count" values, and arrive at 11.80803. Looking at the leaderboards, which feature their RMSLE values, we determine that a RMSLE of 11.8 is rather high. The leading predictor models have RMSLE values as low as 0.295, and Kaggle's Mean Value Benchmark is 1.58456. So quite frankly, this model is no good, which was to be expected.

To fix this, we will run a generalized boosted regression model (gbm) to predict the count in Part 3. So stay tuned for more R!



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Capital Bikeshare Series:
1. Predicting Capital Bikeshare Demand in R: Part 1. Data Exploration
2. Predicting Capital Bikeshare Demand in R: Part 2. Regression
3. Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model

No comments:

Post a Comment