Loading...
Showing posts with label regression. Show all posts
Showing posts with label regression. Show all posts

Thursday, July 10, 2014

Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model


Hello Readers,

Today in Part 3, we turn to a more robust method to predict bike sharing demand: generalized boosted model regression. Last time in Part 2, we began running a linear regression to create an initial prediction model to examine the strength of the predictors. To read about the bike sharing data from the Kaggle Knowledge Competition, click here for Part 1.


We also saw how the root mean squared logarithmic error (RMSLE) evaluated predicted "count" values that were lower or higher than the actual "count" value. So here we will explore how we can improve the RMSLE with a generalized boosted regression model of the bike sharing data.

Let's hop right into R.


Generalized Boosted Regression


Here we utilize GBM with the R library, "gbm", to run the models on the Kaggle bike sharing data. Remember we had to modify and transform some variables into proper format and factor levels, which was covered in Part 1. Then we pass the training variables and the training target, "count", through the"gbm()" function, along with other parameters, shown below.

gbm Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
> # load gbm library
> library(gbm)
> # make sure to set working directory
> 
> # load training and test data
> load("train.rdata")
> test <- read.csv("test.csv")
> 
> # gbm -base model ####
> genmod<-gbm(train$count~.
+             ,data=train[,-c(1,9,10,11)] ## registered,casual,count columns
+             ,var.monotone=NULL # which vars go up or down with target
+             ,distribution="gaussian"
+             ,n.trees=1200
+             ,shrinkage=0.05
+             ,interaction.depth=3
+             ,bag.fraction = 0.5
+             ,train.fraction = 1
+             ,n.minobsinnode = 10
+             ,cv.folds = 10
+             ,keep.data=TRUE
+             ,verbose=TRUE)

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1    32158.0399             nan     0.0500  644.2250
     2    31578.0942             nan     0.0500  585.4711
     3    31044.4208             nan     0.0500  524.7365
     4    30561.4195             nan     0.0500  478.6728
     5    30154.5583             nan     0.0500  413.1806
     6    29746.5141             nan     0.0500  395.9453
     7    29373.1125             nan     0.0500  359.5060
     8    29014.6034             nan     0.0500  343.1201
     9    28703.5502             nan     0.0500  312.3433
    10    28399.9495             nan     0.0500  282.9091
    20    25870.2205             nan     0.0500  175.8114
    40    23836.1755             nan     0.0500   52.8371
    60    23063.7262             nan     0.0500   26.2731
    80    22693.0915             nan     0.0500    4.3578
   100    22450.1503             nan     0.0500   15.8378
   120    22272.5560             nan     0.0500    2.6863
   140    22122.7034             nan     0.0500    2.3877
   160    21992.0298             nan     0.0500   12.2305
   180    21878.9274             nan     0.0500    3.8828
   200    21790.5052             nan     0.0500    1.1173
   220    21692.7944             nan     0.0500   -0.5421
   240    21619.6496             nan     0.0500   -0.3796
   260    21549.5344             nan     0.0500    0.4666
   280    21490.1193             nan     0.0500    1.0782
   300    21435.0103             nan     0.0500   -3.0478
   320    21355.7273             nan     0.0500    2.5832
   340    21308.1967             nan     0.0500   -2.3362
   360    21265.8199             nan     0.0500   -2.0662
   380    21226.2996             nan     0.0500   -3.2661
   400    21187.4927             nan     0.0500   -2.6117
   420    21145.4325             nan     0.0500   -2.0814
   440    21116.4270             nan     0.0500   -2.1384
   460    21083.5071             nan     0.0500   -0.6912
   480    21047.1991             nan     0.0500   -2.5004
   500    21025.6535             nan     0.0500   -5.4226
   520    20998.4169             nan     0.0500   -1.7574
   540    20970.5290             nan     0.0500   -4.4251
   560    20945.7996             nan     0.0500   -1.5079
   580    20914.3748             nan     0.0500   -2.1149
   600    20885.0842             nan     0.0500   -4.0596
   620    20861.5043             nan     0.0500   -3.0287
   640    20834.8791             nan     0.0500   -3.7761
   660    20802.3307             nan     0.0500   -0.5195
   680    20779.3478             nan     0.0500   -8.1288
   700    20751.2605             nan     0.0500   -1.6472
   720    20725.2298             nan     0.0500   -3.7279
   740    20701.8625             nan     0.0500   -5.0591
   760    20683.7046             nan     0.0500   -2.7938
   780    20658.4463             nan     0.0500    0.7247
   800    20637.1474             nan     0.0500   -3.3238
   820    20617.2582             nan     0.0500   -1.8619
   840    20593.8156             nan     0.0500   -2.6460
   860    20569.5343             nan     0.0500   -2.7741
   880    20548.0619             nan     0.0500   -2.5890
   900    20527.9950             nan     0.0500    0.6394
   920    20509.9647             nan     0.0500   -1.7151
   940    20496.3642             nan     0.0500   -2.2759
   960    20478.2300             nan     0.0500   -2.3908
   980    20461.0466             nan     0.0500   -1.1577
  1000    20440.3127             nan     0.0500   -0.7028
  1020    20422.5978             nan     0.0500   -2.6813
  1040    20409.9711             nan     0.0500   -0.9269
  1060    20389.4390             nan     0.0500   -3.7635
  1080    20376.5523             nan     0.0500   -8.8512
  1100    20364.0977             nan     0.0500   -2.5027
  1120    20352.5354             nan     0.0500   -2.6432
  1140    20343.6042             nan     0.0500   -2.0281
  1160    20330.3500             nan     0.0500   -2.9339
  1180    20315.7748             nan     0.0500   -1.6376
  1200    20307.0301             nan     0.0500   -4.4638
> 

Yes, there are many "gbm" parameters we tweak. One thing to note is the variables we used to predict the "count". In the model, we removed the "datetime" and the "windspeed" variables. I set the number of trees (iterations) to be 1200; the shrinkage to be 0.05 which is the learning rate with each expansion; the interaction depth at 3 for 3 way interactions; the bag fraction set at half of the training set for the next tree expansion; minimum observations was set to 10 in each node; the number of cross validation folds was set to 10; and the verbose parameter was set to true for the progress print-out.


The progress output of the model appears directly after I started the model in the Rconsole, every 20 iterations. It includes various measures for the training and test (if we specified one- that's why it's "nan"), step, and score.


Next we find the 'best' iteration out of the 1,200 by using the "gbm.perf()" function, and specifying cross validation as the "method". It returns the best iteration number, and a graph showing us the location of the iteration on the squared loss function (Figure 1). Then we pass the best iteration through and print out the tree of the 'best' iteration below. Additionally, we can call the "summary()" of the gbm object with the best iteration which prints and plots the variable influence (Figure 2).


Best Iteration Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> # best iteration
> # cv
> best.iter <- gbm.perf(genmod,method="cv") ##the best iteration number
> print(pretty.gbm.tree(genmod, best.iter))
  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight  Prediction
0        6   81.50000000        1         8           9       75714.60   5443 -0.05284263
1        4   33.21000000        2         6           7       60396.81   4392 -0.14406699
2        4   26.65000000        3         4           5       98849.76   4191 -0.10346147
3       -1   -0.23966561       -1        -1          -1           0.00   3188 -0.23966561
4       -1    0.32945856       -1        -1          -1           0.00   1003  0.32945856
5       -1   -0.10346147       -1        -1          -1           0.00   4191 -0.10346147
6       -1   -0.99072232       -1        -1          -1           0.00    201 -0.99072232
7       -1   -0.14406699       -1        -1          -1           0.00   4392 -0.14406699
8       -1    0.32837278       -1        -1          -1           0.00   1051  0.32837278
9       -1   -0.05284263       -1        -1          -1           0.00   5443 -0.05284263
> 
> summary(genmod, n.trees=best.iter)
                  var    rel.inf
humidity     humidity 34.2059372
atemp           atemp 26.0106434
temp             temp 22.6269229
season         season 10.9486469
workingday workingday  3.5587532
weather       weather  2.1290432
holiday       holiday  0.5200532

Best Iteration Graphs:
Fig. 1: Pretty gbm

We can see the best iteration in Figure 1 as the vertical dotted blue line, and yes it is close to 1,200. In fact, the exact iteration selected was 1,199. Also note the dramatic initial decrease in squared error loss in the first 100 iterations.



Fig. 2: Summary gbm

Then we have the summary of the relevant variables in the gbm model in a plot, above. It indicates humidity (34%), 'feels like' temperature (26%), and temperature (22.6%) round out the top 3 variables.



Evaluation: RMLSE


Now we have our gbm model and our best iteration from the model. To compare the RMLSE for our gbm model, we go one step further and use the "test" data to predict the "count" target and submit it to Kaggle, instead of predicting from a random sample of training data.

If you have not read in the "test" data, do so now, and call "head" to get an idea of what variables it contains. This way we know which variables to subset out of the "predict()" function since our gbm model does not account for "datetime" or "windspeed". So so in the "test" set, variables 1 and 9 are taken out, and we observe negative values in our predicted data. That is not valid, since we cannot have a negative bike sharing "count" for that particular hour (row). So we fix this by imputing all negative predictions with a zero.


Creating Results Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> # predict test data ####
> # use best iteration from cv
> pred.test <- predict(genmod, test[,-c(1,9)], best.iter, type="response")
> summary(pred.test)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -22.93   98.22  167.70  216.00  315.00  639.10 
> pred.test[pred.test<0] <- 0
> # create output file
> output <- data.frame(datetime=test[,1], count=pred.test)
> write.csv(output, file="results.csv", quote=FALSE, row.names=FALSE)
> 

Then we create a data.frame called "output", but you can name it what you like, with 2 variables: the "datetime" and the predicted "count"s. For Kaggle to accept the "output", we need to write it as a comma separated file, or CSV, with "write.csv()". However, we should get the "quote" parameter to FALSE so it does not place quotation marks around factors or strings, and the "row.names" to FALSE because the Kaggle submission format only has 2 columns. I changed the name to "results.csv".


The "results" CSV file looks like this when viewed in Notepad:


Submission Ready Results 

See the two columns of "datetime" and "count", separated by commas? Just what Kaggle prefers. Now log in to Kaggle and find the submission page on the Dashboard to the left. Submit it, and as you can see below, the submitted gbm model's RMSLE was less than Kaggle's Mean Value Benchmark (1.38 vs 1.58)!



Kaggle Leaderboad

Fantastic result. Our gbm model clearly performed better than our linear regression model in Part 2, which had a double digit RMSLE score. However, our work is not finished yet. The gbm model parameters can be tweaked, and our variables selected and transformed in different ways to improve the RMSLE score. 


As you can guess, that is for the next post, where we optimize our gbm model! So stay tuned for more R analysis in Part 3.


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Capital Bikeshare Series:
1. Predicting Capital Bikeshare Demand in R: Part 1. Data Exploration
2. Predicting Capital Bikeshare Demand in R: Part 2. Regression
3. Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model

Wednesday, July 2, 2014

Predicting Capital Bikeshare Demand in R: Part 2. Regression


Hello Readers,

Today we continue our Kaggle Knowledge Discovery series of predicting bike sharing demand from Washington D.C.'s Capital Bikeshare program. Since we explored the data, and visually stratified our target "count" variable in Part 1, here we progress by generating a predictive model.



This R post is a direct continuation from the previous Part 1.


Linear Regression


Since this is the first time we will model the "count" target, we will use basic regression and all the covariates as predictors to get an idea of how well the variables predict the "count". In Part 1, we did see (weak) visual correlations with "season", "weather", and "temperature". We will remember to examine their coefficients in the regression model summary. I ran the regression using the code on our train dataset below:

Linear Regression Code & Output:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
> names(train)
 [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
 [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
[11] "registered" "count" 

# running linear regression model
# exclude #casual and # registered
> train.lm <- lm(count ~ ., data=train[,-c(10,11)])
> summary(train.lm)

Call:
lm(formula = count ~ ., data = train[, -c(10, 11)])

Residuals:
    Min      1Q  Median      3Q     Max 
-326.85  -98.03  -24.58   71.78  656.91 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.066e+03  1.253e+02 -24.476  < 2e-16 ***
datetime     2.454e-06  9.043e-08  27.136  < 2e-16 ***
season2     -1.297e+01  5.238e+00  -2.476  0.01332 *  
season3     -6.191e+01  6.742e+00  -9.183  < 2e-16 ***
season4      1.037e+01  4.837e+00   2.145  0.03198 *  
holiday1    -8.174e+00  8.882e+00  -0.920  0.35740    
workingday1 -2.138e+00  3.177e+00  -0.673  0.50108    
weather.L   -9.431e+01  1.001e+02  -0.942  0.34631    
weather.Q    7.285e+01  7.467e+01   0.976  0.32926    
weather.C   -4.619e+01  3.360e+01  -1.375  0.16924    
temp         6.706e+00  1.168e+00   5.741 9.67e-09 ***
atemp        3.291e+00  1.024e+00   3.213  0.00132 ** 
humidity    -2.673e+00  9.068e-02 -29.472  < 2e-16 ***
windspeed    8.439e-01  1.927e-01   4.380 1.20e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 149.2 on 10872 degrees of freedom
Multiple R-squared:  0.3226, Adjusted R-squared:  0.3218 
F-statistic: 398.3 on 13 and 10872 DF,  p-value: < 2.2e-16

Make sure to exclude the user demand variables "casual" and "registered" because they add to total "count" demand, and we do not have those variables available in the test set. Or else the 'prediction' would simply be an easy sum of the two variables.


Looking at the output, we see a wide range of residuals (-326.85, 656.91) for the target variable- not a good indicator. In the coefficients section, we see the factor variables transformed into dummy variables, relative to the first level. Comparing the graphical correlations from Part 1, the model informs us that "datetime", "season3" relative to "season1", "temp", "humidity", and "windspeed" predict the target "count" better than the other variables.

However not that well, since the R-squared was only 0.3226, which means the linear model accounts for roughly 32% of the variation in "count". But do not worry, this is only a preliminary model, we will use more extravagant techniques in Part 3. This basic regression is simply a benchmark model, so we can see how the variables are performing, and how well the future models have improved (or receded) in predictive performance.


Evaluation Metric- RMSLE


Submissions to Kaggle require us to submit the predicted "count" variable along with the "datetime" stamp. But how do they score the submissions to create the leaderboards? In other words, how do they determine whose model performed better?

This is where they use a metric to score the prediction submissions with the actual counts. Sometimes we use mean squared error, sometimes we use ROC curves, other times we use something else, like the root mean squared logarithmic error (RMSLE).

The RMSLE equation is evaluated like so:

1ni=1n(log(pi+1)log(ai+1))2

with n = number of hours in test set, p = predicted count, a = actual count, and log(x) is the natural log (ln). The RMSLE penalizes extreme prediction values, so the RMSLE will be higher the farther the predicted values are from the actual. Let's see how this works visually:


Understanding RMSLE Code:
1
2
3
4
5
> # RMSLE-like values for various predictions
> p <- seq(1,100, by=1)
> a <- rep(40,100)
> rmsle <- ((log(p+1)-log(a+1))^2)^0.5
> plot(p,rmsle, type="l", main="Basic RMSLE Pattern", xlab="Predicted")


As you can see, the actual value at 40 yields a RMSLE of 0. It generates higher RMSLE values for predictions lower than the actual, compared to predictions higher than the actual value. Lower values will be penalized more, since (log((p+1)/(a+1))), and lim(x->0) log(x) goes to negative infinity (positive infinity in RMSLE) the smaller prediction 'p' is compared to actual 'a'. Either way, the RMSLE generates higher values for larger discrepancies between predicted and actual values.


Model Evaluation


We can sample 1000 random hours as our test set, so that we can calculate the RMSLE with the predicted and actual counts.

1000 Sample Predictions & RMSLE Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# total hours in training set
> nrow(train)
[1] 10886
> 
> # predict using 1000 samples
> i.test <- sample(1:nrow(train), 1000)
> # create test variables
> test.1 <- train[i.test,1:9]
> # create test target variable
> test.1.target <- train[i.test,12]
> # predict $count using test variables
> test.1.pred <- predict(train.lm, newdata=test.1)
> 
> summary(test.1.pred) # has negatives
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -104.1   110.8   186.9   186.1   258.0   503.1 
> 
> # n, number of test samples
> n <- 1000
> # eliminate negative predictions
> test.1.pred[test.1.pred<0] <- 0
> 
> summary(test.1.pred)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   110.8   186.9   187.3   258.0   503.1 
> 
> 
> # root mean squared log error (RMSLE)
> # [(1/n)*sum(log(p+1)-log(a-1))^2]^0.5
> # n = number of hours
> # p = predicted count; a = actual count
> # log = ln
> 
> (test.1.rmsle <- ((1/n)*sum(log(test.1.pred+1)-log(test.1.target+1))^2)^0.5)
[1] 11.80803
> 

To create our 'test' set, we sample 1000 random indexes from 10,886 rows (each an hour), and use the index to isolate 1000 predictor variables and 1000 target variables (actual values). Then we use
predict() to create prediction "count" values from our linear regression model. However, from the summary of the predicted values, we see (sadly) that there are negative values- which is out of the "count" variable boundary. So we simple assume that the predicted bike demand during those hours will be zero.

Then we run the RMSLE with the modified predicted values and the actual "count" values, and arrive at 11.80803. Looking at the leaderboards, which feature their RMSLE values, we determine that a RMSLE of 11.8 is rather high. The leading predictor models have RMSLE values as low as 0.295, and Kaggle's Mean Value Benchmark is 1.58456. So quite frankly, this model is no good, which was to be expected.

To fix this, we will run a generalized boosted regression model (gbm) to predict the count in Part 3. So stay tuned for more R!



Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Capital Bikeshare Series:
1. Predicting Capital Bikeshare Demand in R: Part 1. Data Exploration
2. Predicting Capital Bikeshare Demand in R: Part 2. Regression
3. Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model

Thursday, April 17, 2014

R: Comparing Multiple and Neural Network Regression


Hello Readers,


Today we have a special competition between linear and neural network regression. Which will model diamond data best? Load the ggplot2, RSNNS, MASS, and caret packages, and let us turn R into a diamond expert. While the statistical battle begins, we can learn how diamonds are priced.


READY TO RUMBLE: Diamond Data


Here we will compare and evaluate the results from multiple regression and a neural network on the diamonds data set from the ggplot2 package in R. Consisting of 53,940 observations with 10 variables, diamonds contains data on the carat, cut, color, clarity, price, and diamond dimensions. These variables have a particular effect on price, and we would like to see if they can predict the price of various diamonds.


diamonds Data

The cut, color, and clarity variables are factors, and must be treated as dummy variables in multiple and neural network regressions. Let us start with multiple regression.



TEAM: Multiple Regression


First we ready TEAM: Multiple Regression by sampling the rows to randomize the observations, and then create a sample index of 0's and 1's to separate the training and test sets. Note that the depth and table columns (5, 6) are removed because they are linear combinations of the dimensions, x, y, and z. See that the observations in the training and test sets approximate 70% and 30% of the total observations, from which we sampled and set the probabilities.


Train and Test Sets

Now we move into the next stage with multiple regression via the
train() function from the caret library, instead of the regular lm() function. We specify the predictors, the response variable (price), the "lm" method, and the cross validation resampling method.


Regression

When we call the train(ed) object, we can see the attributes of the training set, resampling, sample sizes, and the results. Note the root mean square error value of 1150. Will that be low enough to take down heavy weight TEAM: Neural Network? Below we visualize the training diamond prices and the predicted prices with ggplot().


Plotting Training and Predicted Prices

We see from the axis, the predicted prices have some high values compared to the actual prices. Also, there are predicted prices below 0, which cannot be possible in the observed, which will set TEAM: Multiple Regression back a few points.



Next we use
ggplot() again to visualize the predicted and observed diamond prices from the test data, which did not train the linear regression model.


Plotting Predicted and Observed from Test Set

Similar to the training prices plot, we see here in the test prices that the model over predicts larger values and also predicted negative price values. In order for TEAM: Multiple Regression to win, TEAM: Neural Network has to have more wild prediction values.




Lastly, we calculate the root mean square error, by taking the mean of the squared difference between the predicted and observed diamond prices. The resulting RMSE is 1110.843, similar to the RMSE of the training set.



Linear Regression RMSE, Test Set

Below is a detailed output of the model summary, with the coefficients and residuals. Observe how carat is the best predictor, with the highest t value at 191.7, with every increase in 1 carat holding all other variables equal, results in a 10,873 dollar increase in value. As we look at the factor variables, we do not see a reliable increase in coefficients with increases in level value.


Summary Output, Linear Regression

Now we move on to the neural network regression.



TEAM: Neural Network


TEAM: Neural Network must ready itself as well. Because neural networks operate in terms of 0 to 1, or -1 to 1, we must first normalize the price variable to 0 to 1, making the lowest value 0 and the highest value 1. We accomplished this using the
normalizeData() function. Save the price output in order to revert the normalization after training the data. Also, we take the factor variables and turn them into numeric labels using toNumericClassLabels(). Below we see the normalized prices before they are split into a training and test set with splitForTrainingAndTest() function.


Numeric Labels, Normalized Prices, and Data Splitting

Now TEAM: Neural Network are ready for the multi-layer perceptron (MLP) regression. We define the training inputs (predictor variables) and targets (prices), the size of the layer (5), the incremented learning parameter (0.1), the max iterations (100 epochs), and also the test input/targets.



Multi-Layer Perceptron Regression

If you spectators have dealt with
mlp() before, you know the summary output can be quite lenghty, so it is omitted (we dislike commercials too). We move to the visual description of the MLP model with the iterative sum of square error for the training and test sets. Additionally, we plot the regression error (predicted vs observed) for the training and test prices.


Plotting Model Summaries

Time for TEAM: Neural Network so show off its statistical muscles! First up, we have the iterative sum of square error for each epoch, noting that we specified a maximum of 100 in the MLP model. We see an immediate drop in the SSE with the first few iterations, with the SSE leveling out around 50. The test SSE, in red, fluctuations just above 50 as well. Since the SSE began to plateau, the model fit well but not too well, since we want to avoid over fitting the model. So 100 iterations was a good choice.




Second, we observe the regression plot with the fitted (predicted) and target (observed) prices from the training set. The prices fit reasonably well, and we see the red model regression line close to the black (y=x) optimal line. Note that some middle prices were over predicted by the model, and there were no negative prices, unlike the linear regression model.




Third, we look at the predicted and observed prices from the test set. Again the red regression line approximates the optimal black line, and more price values were over predicted by the model. Again, there are no negative predicted prices, a good sign.




Now we calculate the RMSE for the training set, which we get 692.5155. This looks promising for TEAM: Neural Network!



Calculating RMSE for MLP Training Set

Naturally we want to calculate the RMSE for the test set, but note that in the real world, we would not have the luxury of knowing the real test values. We arrive at 679.5265.


Calculating RMSE for MLP Test Set

Which model was better in predicting the diamond price? The linear regression model with 10 fold cross validation, or the multi-layer perceptron model with 5 nodes run to 100 iterations? Who won the rumble?




RUMBLE RESULTS


From calculating the two RMSE's from the training and test sets for the two TEAMS, we wrap them in a list. We named the TEAM: Multiple Regression as linear, and the TEAM: Neural Network regression as neural.



Creating RMSE List

Below we can evaluate the models from their RMSE values. 



All RMSE Values

Looking at the training RMSE first, we see a clear difference as the linear RMSE was 66% larger than the neural RMSE, at 1,152.393 versus 692.5155. Peeking into the test sets, we have a similar 63% larger linear RMSE than the neural RMSE, with 1,110.843 and  679.5265 respectively. TEAM: Neural Network begins to gain the upper hand in the evaluation round.


One important difference between the two models was the range of the predictions. Recall from both training and test plots that the linear regression model predicted negative price values, whereas the MLP model predicted only positive prices. This is a devastating blow to TEAM: Multiple Regression. Also, the over prediction of prices existed in both models, however the linear regression model over predicted those middle values higher the anticipated maximum price values.

Sometimes the simple models are optimal, and other times more complicated models are better. This time, the neural network model prevailed in predicting diamond prices.


Winner: TEAM: Neural Network


Stay tuned for more analysis, and more rumbles. TEAM: Multiple Regression wants a rematch!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn
rumble