Pages

Thursday, April 17, 2014

R: Comparing Multiple and Neural Network Regression


Hello Readers,


Today we have a special competition between linear and neural network regression. Which will model diamond data best? Load the ggplot2, RSNNS, MASS, and caret packages, and let us turn R into a diamond expert. While the statistical battle begins, we can learn how diamonds are priced.


READY TO RUMBLE: Diamond Data


Here we will compare and evaluate the results from multiple regression and a neural network on the diamonds data set from the ggplot2 package in R. Consisting of 53,940 observations with 10 variables, diamonds contains data on the carat, cut, color, clarity, price, and diamond dimensions. These variables have a particular effect on price, and we would like to see if they can predict the price of various diamonds.


diamonds Data

The cut, color, and clarity variables are factors, and must be treated as dummy variables in multiple and neural network regressions. Let us start with multiple regression.



TEAM: Multiple Regression


First we ready TEAM: Multiple Regression by sampling the rows to randomize the observations, and then create a sample index of 0's and 1's to separate the training and test sets. Note that the depth and table columns (5, 6) are removed because they are linear combinations of the dimensions, x, y, and z. See that the observations in the training and test sets approximate 70% and 30% of the total observations, from which we sampled and set the probabilities.


Train and Test Sets

Now we move into the next stage with multiple regression via the
train() function from the caret library, instead of the regular lm() function. We specify the predictors, the response variable (price), the "lm" method, and the cross validation resampling method.


Regression

When we call the train(ed) object, we can see the attributes of the training set, resampling, sample sizes, and the results. Note the root mean square error value of 1150. Will that be low enough to take down heavy weight TEAM: Neural Network? Below we visualize the training diamond prices and the predicted prices with ggplot().


Plotting Training and Predicted Prices

We see from the axis, the predicted prices have some high values compared to the actual prices. Also, there are predicted prices below 0, which cannot be possible in the observed, which will set TEAM: Multiple Regression back a few points.



Next we use
ggplot() again to visualize the predicted and observed diamond prices from the test data, which did not train the linear regression model.


Plotting Predicted and Observed from Test Set

Similar to the training prices plot, we see here in the test prices that the model over predicts larger values and also predicted negative price values. In order for TEAM: Multiple Regression to win, TEAM: Neural Network has to have more wild prediction values.




Lastly, we calculate the root mean square error, by taking the mean of the squared difference between the predicted and observed diamond prices. The resulting RMSE is 1110.843, similar to the RMSE of the training set.



Linear Regression RMSE, Test Set

Below is a detailed output of the model summary, with the coefficients and residuals. Observe how carat is the best predictor, with the highest t value at 191.7, with every increase in 1 carat holding all other variables equal, results in a 10,873 dollar increase in value. As we look at the factor variables, we do not see a reliable increase in coefficients with increases in level value.


Summary Output, Linear Regression

Now we move on to the neural network regression.



TEAM: Neural Network


TEAM: Neural Network must ready itself as well. Because neural networks operate in terms of 0 to 1, or -1 to 1, we must first normalize the price variable to 0 to 1, making the lowest value 0 and the highest value 1. We accomplished this using the
normalizeData() function. Save the price output in order to revert the normalization after training the data. Also, we take the factor variables and turn them into numeric labels using toNumericClassLabels(). Below we see the normalized prices before they are split into a training and test set with splitForTrainingAndTest() function.


Numeric Labels, Normalized Prices, and Data Splitting

Now TEAM: Neural Network are ready for the multi-layer perceptron (MLP) regression. We define the training inputs (predictor variables) and targets (prices), the size of the layer (5), the incremented learning parameter (0.1), the max iterations (100 epochs), and also the test input/targets.



Multi-Layer Perceptron Regression

If you spectators have dealt with
mlp() before, you know the summary output can be quite lenghty, so it is omitted (we dislike commercials too). We move to the visual description of the MLP model with the iterative sum of square error for the training and test sets. Additionally, we plot the regression error (predicted vs observed) for the training and test prices.


Plotting Model Summaries

Time for TEAM: Neural Network so show off its statistical muscles! First up, we have the iterative sum of square error for each epoch, noting that we specified a maximum of 100 in the MLP model. We see an immediate drop in the SSE with the first few iterations, with the SSE leveling out around 50. The test SSE, in red, fluctuations just above 50 as well. Since the SSE began to plateau, the model fit well but not too well, since we want to avoid over fitting the model. So 100 iterations was a good choice.




Second, we observe the regression plot with the fitted (predicted) and target (observed) prices from the training set. The prices fit reasonably well, and we see the red model regression line close to the black (y=x) optimal line. Note that some middle prices were over predicted by the model, and there were no negative prices, unlike the linear regression model.




Third, we look at the predicted and observed prices from the test set. Again the red regression line approximates the optimal black line, and more price values were over predicted by the model. Again, there are no negative predicted prices, a good sign.




Now we calculate the RMSE for the training set, which we get 692.5155. This looks promising for TEAM: Neural Network!



Calculating RMSE for MLP Training Set

Naturally we want to calculate the RMSE for the test set, but note that in the real world, we would not have the luxury of knowing the real test values. We arrive at 679.5265.


Calculating RMSE for MLP Test Set

Which model was better in predicting the diamond price? The linear regression model with 10 fold cross validation, or the multi-layer perceptron model with 5 nodes run to 100 iterations? Who won the rumble?




RUMBLE RESULTS


From calculating the two RMSE's from the training and test sets for the two TEAMS, we wrap them in a list. We named the TEAM: Multiple Regression as linear, and the TEAM: Neural Network regression as neural.



Creating RMSE List

Below we can evaluate the models from their RMSE values. 



All RMSE Values

Looking at the training RMSE first, we see a clear difference as the linear RMSE was 66% larger than the neural RMSE, at 1,152.393 versus 692.5155. Peeking into the test sets, we have a similar 63% larger linear RMSE than the neural RMSE, with 1,110.843 and  679.5265 respectively. TEAM: Neural Network begins to gain the upper hand in the evaluation round.


One important difference between the two models was the range of the predictions. Recall from both training and test plots that the linear regression model predicted negative price values, whereas the MLP model predicted only positive prices. This is a devastating blow to TEAM: Multiple Regression. Also, the over prediction of prices existed in both models, however the linear regression model over predicted those middle values higher the anticipated maximum price values.

Sometimes the simple models are optimal, and other times more complicated models are better. This time, the neural network model prevailed in predicting diamond prices.


Winner: TEAM: Neural Network


Stay tuned for more analysis, and more rumbles. TEAM: Multiple Regression wants a rematch!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn
rumble

6 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Great blog! I have a question though, I used your codes for a regression problem using mlp and my rmse is around 0.096 which looks good, but when I tried to see the predicted values, I don't know if I am looking in the right place, it looks way off.
    Is test.obs the predicted values of the test data? thanks mate

    ReplyDelete
    Replies
    1. Did you unnormalize those predicted values? The output from MLP are values 0 to 1. Hope that helps.

      Delete
  3. For a multiple linear regression problem (my Y is sales).I have Training set and Testing set. As per above example had used the Training set & build the model after normalizing the output.

    After model building how to predict with my Testing data set

    ReplyDelete
  4. Nice Post, Perhaps you can write next
    articles regarding this article. I want to read more things approximately it!kajal hot

    ReplyDelete
  5. Dapatkan Pasaran Bola Terbaik di Situs Agen Resmi BOLAVITA !

    www.bolavita.site Agen Taruhan Bola Online yang sudah di percaya dan sudah berdiri sangat lama di dunia perrjudiian Indonesia !

    Aman dan Terpercaya !

    Hubungi Cs kami yang bertugas 24 jam Online :

    BBM: BOLAVITA
    WA: +6281377055002

    Atau bisa langsung download Aplikasi Resmi BOLAVITA :
    Aplikasi Playstore : Bolavita Sabung Ayam

    ReplyDelete