Pages

Wednesday, March 12, 2014

R: World Wheat Production Part II, Linear Filtering and Regression Forecasting


Hello Readers,



Today we will continue analyzing the wheat production and harvest area data from the Food and Agriculture Organization of the UN. In the previous post, we imported the data from Data Market and proceeded to plot the production quantity and harvest area time series, reproduced below:


With the wheat production increasing more than the harvest area over the years, we also plotted histograms to track the lag of the log of wheat production. We noted that the mean difference were, as expected, above zero.

Next we move on to model the time series single component of trend in the wheat production in R. We will not focus on the Harvest Area series because it does not fluctuate like the Production Quantity series, as seen above. Let us get started.



Linear Filtering


Now that we have a good grasp on the wheat data, we can model the time series to find a trend. A time series is usually decomposed into multiple components: a trend, a seasonal component, and a remainder component. One method to accomplish trend discovery is by using linear filtering with moving averages and equal weights. Each element of the time series X is filtered through average a, such that X-a to Xa are multiplied by coefficients (1/(2a+1)). For example, with a filter of a=2, for an element Xn, sum of the previous two and next 2 elements and Xn, are divided by 1/5, as 1/(2*2+1).



In R, we model the trend component with the filter() function, and specify the coefficients. This allows us to plot the trend with different moving averages of 2 and 10 over the wheat production quantity time series.

Plotting Wheat Production with Moving Averages
Below, we can observe the moving averages of 2 and 10 overlaying the Production Quantity time series.



Note that the red moving average 2 trend line more closely follows the fluctuations of the time series than the green average 10. Why? Because the moving average = 10 takes the average of the last and next 10 for that series element. 


So the green trend line incorporates more values in calculating the moving average (a=10, n is 21; a=2, n is 5) than the red trend line. Therefore green trend line is smoother and depicts a trend covering a longer time frame than the red trend line, which shows the average over a shorter time frame, thus fluctuating more with the wheat production data.



Time Series Decomposition


Now that we have covered the trend component, we turn to the next component, seasonal. The seasonal component attempts to capture the variation within a time frame, usually a year- so that the fluctuations between the seasons or quarters, are captured. However, for our wheat data, we only have yearly measures in the time series so seasonal decomposition will not be possible. There are no data points within each year to determine seasonal variations.


But do not worry! In the future, we will cover additional time series data sets which do have multiple measures through each year, enabling us to perform the seasonal decomposition. The posts will be linked back to here.


We can model the time series further by using regression.




Regression


We covered linear regression with chemical predictors in this post. With our wheat time series, we can use the years as the predictor to determine coefficients to fit a trend for the log of wheat production. The equation is modeled as follows where
t is the years variable:

log(wheat) = a0 + a1*t + a2*t^2 + e


log Wheat Regression
Evaluating the model summary, we see that both t and the squared term, t^2, are both significant in predicting the wheat production time series. Next we plot the fitted regression line over the wheat time series to observe the fit.


Plotting Wheat Regression
Look how well the red regression line fits wheat production. Not to shabby, and the R squared value confirms this, at 0.9798, which means nearly 98% of the variance is captured by the regression model.



However, the regression model is based on a parabola equation, with the t^2 term. So it appears that the continuation of the log of wheat production in fact decreases according to the regression model. Let us plot the prediction forecast for the log of wheat production to year 2020 based on our model.


Plotting log Wheat Production Prediction
We create a new data.frame with the two t terms from 1961 all the way to 2020 for the prediction data points. Then using the predict() function, we obtained the predicted values from the regression model for log of wheat production from 1961 to 2020 in wheat2.lmp. Next we transformed the wheat2.lmp into a time series and we plotted it with the log of wheat production time series:


And yes, the prediction does curve down! From the available data and quadratic regression model, the log of wheat production was modeled to be parabola-like. Thinking logically, wheat production should not decrease in the near future due to rising demand from the increasing world population. The regression model we have fits the wheat data we used to generate the model. 

In this case, it is difficult to forecast the model to future years using stl() because there is no seasonality in our data points (no seasonal decomposition here). By only using yearly points, we can only predict the trend. Here is a recent post with seasonal decomposition of Amazon stock prices.

In the future, take note of the time resolution of your time series data. If it lacks seasonal data points, it will be more difficult to forecast accurately by only relying on the overall trend.


One More Regression Model


Let us consider one more regression model based on the wheat production, rather than the log of wheat production. The larger values in log of wheat production actually correspond to higher values in wheat production, and the scale of the differences are much larger. So there is a possibility that the higher values in log production were closer together, resulting in a highly parabolic model. 

Using the regular wheat production numbers might alleviate this issue. Below we model the linear regression model with two terms, and plot the log prediction and the wheat production over the predicted wheat production time series.


wheat[,2] = a0 + a1*t + a2*t^2 + e


Wheat Production Regression Prediction and Plotting
By plotting both the log wheat predicted production and wheat predicted production, we can see which models the time series better into the future.


We see the regular regression through the red line, extending outwards but not quite decreasing (yet) by 2020. This is more reasonable than the prediction from the log of wheat production in blue, where after 2007, the predicted production drops. However, this is far from the finished model, and as more production data from subsequent years are obtained, those points can be added to the time series and enable us to predict the production in the near future better than without most recent data.

Through 2020 is where the model begins to break down (the log model begins at 2007), so with further data points, we can extend the forecast corresponding to the number of yearly data points we add.

Stay tuned for more time series in R!



Thanks for reading,

Wayne
@beyondvalence

No comments:

Post a Comment