Valence Analytics: wheat

Showing posts with label wheat. Show all posts

Wednesday, March 12, 2014

R: World Wheat Production Part II, Linear Filtering and Regression Forecasting

Hello Readers,

Today we will continue analyzing the wheat production and harvest area data from the Food and Agriculture Organization of the UN. In the previous post, we imported the data from Data Market and proceeded to plot the production quantity and harvest area time series, reproduced below:

With the wheat production increasing more than the harvest area over the years, we also plotted histograms to track the lag of the log of wheat production. We noted that the mean difference were, as expected, above zero.

Next we move on to model the time series single component of trend in the wheat production in R. We will not focus on the Harvest Area series because it does not fluctuate like the Production Quantity series, as seen above. Let us get started.

Linear Filtering

Now that we have a good grasp on the wheat data, we can model the time series to find a trend. A time series is usually decomposed into multiple components: a trend, a seasonal component, and a remainder component. One method to accomplish trend discovery is by using linear filtering with moving averages and equal weights. Each element of the time series X is filtered through average a, such that X-a to Xa are multiplied by coefficients (1/(2a+1)). For example, with a filter of a=2, for an element Xn, sum of the previous two and next 2 elements and Xn, are divided by 1/5, as 1/(2*2+1).

In R, we model the trend component with the filter() function, and specify the coefficients. This allows us to plot the trend with different moving averages of 2 and 10 over the wheat production quantity time series.

Plotting Wheat Production with Moving Averages

Below, we can observe the moving averages of 2 and 10 overlaying the Production Quantity time series.

Note that the red moving average 2 trend line more closely follows the fluctuations of the time series than the green average 10. Why? Because the moving average = 10 takes the average of the last and next 10 for that series element.

So the green trend line incorporates more values in calculating the moving average (a=10, n is 21; a=2, n is 5) than the red trend line. Therefore green trend line is smoother and depicts a trend covering a longer time frame than the red trend line, which shows the average over a shorter time frame, thus fluctuating more with the wheat production data.

Time Series Decomposition

Now that we have covered the trend component, we turn to the next component, seasonal. The seasonal component attempts to capture the variation within a time frame, usually a year- so that the fluctuations between the seasons or quarters, are captured. However, for our wheat data, we only have yearly measures in the time series so seasonal decomposition will not be possible. There are no data points within each year to determine seasonal variations.

But do not worry! In the future, we will cover additional time series data sets which do have multiple measures through each year, enabling us to perform the seasonal decomposition. The posts will be linked back to here.

We can model the time series further by using regression.

Regression

We covered linear regression with chemical predictors in this post. With our wheat time series, we can use the years as the predictor to determine coefficients to fit a trend for the log of wheat production. The equation is modeled as follows where t is the years variable:

log(wheat) = a0 + a1*t + a2*t^2 + e

log Wheat Regression

Evaluating the model summary, we see that both t and the squared term, t^2, are both significant in predicting the wheat production time series. Next we plot the fitted regression line over the wheat time series to observe the fit.

Plotting Wheat Regression

Look how well the red regression line fits wheat production. Not to shabby, and the R squared value confirms this, at 0.9798, which means nearly 98% of the variance is captured by the regression model.

However, the regression model is based on a parabola equation, with the t^2 term. So it appears that the continuation of the log of wheat production in fact decreases according to the regression model. Let us plot the prediction forecast for the log of wheat production to year 2020 based on our model.

Plotting log Wheat Production Prediction

We create a new data.frame with the two t terms from 1961 all the way to 2020 for the prediction data points. Then using the predict() function, we obtained the predicted values from the regression model for log of wheat production from 1961 to 2020 in wheat2.lmp. Next we transformed the wheat2.lmp into a time series and we plotted it with the log of wheat production time series:

And yes, the prediction does curve down! From the available data and quadratic regression model, the log of wheat production was modeled to be parabola-like. Thinking logically, wheat production should not decrease in the near future due to rising demand from the increasing world population. The regression model we have fits the wheat data we used to generate the model.

In this case, it is difficult to forecast the model to future years using stl() because there is no seasonality in our data points (no seasonal decomposition here). By only using yearly points, we can only predict the trend. Here is a recent post with seasonal decomposition of Amazon stock prices.

In the future, take note of the time resolution of your time series data. If it lacks seasonal data points, it will be more difficult to forecast accurately by only relying on the overall trend.

One More Regression Model

Let us consider one more regression model based on the wheat production, rather than the log of wheat production. The larger values in log of wheat production actually correspond to higher values in wheat production, and the scale of the differences are much larger. So there is a possibility that the higher values in log production were closer together, resulting in a highly parabolic model.

Using the regular wheat production numbers might alleviate this issue. Below we model the linear regression model with two terms, and plot the log prediction and the wheat production over the predicted wheat production time series.

wheat[,2] = a0 + a1*t + a2*t^2 + e

Wheat Production Regression Prediction and Plotting

By plotting both the log wheat predicted production and wheat predicted production, we can see which models the time series better into the future.

We see the regular regression through the red line, extending outwards but not quite decreasing (yet) by 2020. This is more reasonable than the prediction from the log of wheat production in blue, where after 2007, the predicted production drops. However, this is far from the finished model, and as more production data from subsequent years are obtained, those points can be added to the time series and enable us to predict the production in the near future better than without most recent data.

Through 2020 is where the model begins to break down (the log model begins at 2007), so with further data points, we can extend the forecast corresponding to the number of yearly data points we add.

Stay tuned for more time series in R!

Thanks for reading,

Wayne
@beyondvalence

Sunday, March 9, 2014

R Time Series: World Wheat Production and Harvest Area, Part I

Hello Readers,

Welcome back folks. Today we will revisit time series through the lens of World Wheat Production Quantity and Harvest Area. I recently discovered a data repository called Data Market, which has data sets by industry, country, and other various topics and sources, including the United Nations.

This particular data set on wheat was provided by the Food and Agriculture Organization (FAO) of the United Nations. This UN organization helps developing and developed countries "improve agriculture, forestry, and fishing practices, and ensuring good nutrition and food security." They are headquartered in Rome, Italy.

*Tasty Wheat*

The world produced 651 million tons of wheat in 2010. The cereal grain production was behind maize (844 million tons), and rice (672 million tons). Wheat is the leading source of vegetable protein in human food, and also the trade for wheat surpasses all other crops in the world combined.

So wheat is an important crop, and many of us eat some form of that cereal grain everyday, whether it be as bread or cereal product. So let us start number munching!

Getting Wheat from (Data) Market to R

Many data sets exist on Data Market, and with a quick search, we can find our wheat production data. I found the wheat data sifting through in the Food and Agriculture Industry under the Crops header, shown below.

Food and Agriculture Industry

Here it is, and after scrolling down and clicking explore, we arrive at the wheat data page. Make sure to select "Area Harvested" in addition to the "Production Quantity" under "Element/Unit". Multiple formats are available for download under the Export tab. I used the CSV format separated via commas. Choose a file format with which you are familiar.

Wheat Harvest Area and Production Quantity

Now that we have the wheat CSV data, we can import it into R and begin the analysis.

Wheat Data

We saw what the data looked like online, and we would like to reproduce it in R. Let use plot the wheat time series data.

First read in the CSV file, and now we have 3 columns with 48 rows of data. The first column in wheat denotes years, the second- Area Harvested in hectares (Ha), and the third- Production Quantity in metric tonnes. Note the last row. It is a comment-NA and we need to remove it. Also, let us specify the beginning of the time series at year 1961 with 1 measurement per year.

We can do all this by selecting the elements of wheat to convert into a times series object. The first column is not required and the 48th row needs to be excluded, which gives us wheat[-48, 2:3]. The time units are described as starting in 1961, with 1 measurement per unit time.

Reading in Wheat CSV

Now we can plot the wheat data with plot(). First plot the Production Quantity in the second column of wheat, then add the Harvest Area in the first column with lines(). The default x axis tick marks only print by decade so I specified a custom x-axis with axis() and the pretty() function for aesthetics. It is a useful function for axis customization, along with legend() to clarify the plot lines.

Plotting Wheat

And below we have the world wheat production in tonnes (metric) and harvest area (Ha) graphic. Observe a general trend upwards in wheat production over 47 years, with a relatively stable harvest area. It appears that the amount of wheat produced per area has increased dramatically over the years.

Wheat production in 1961 measured about 222 million metric tonnes and in 2007, it had increased to over 600 million metric tonnes. Surprisingly, the Harvest Area remained relatively stable from 1961 at 204 million hectares to over 214 million hectares in 2007.

Perhaps this relationship would be better visualized in a plot of the ratio of the two series. Take the the production series and divide it by the harvest area to obtain the ratio.

Wheat Ratio Plot Code

Observe the ratio close to 1.09 at 1961 and rising to around 2.82 in 2007. This is a 260% rise in wheat production efficiency per hectare over the 47 years.

Those figures are obtained below by dividing the ratio from 2007 by the ratio from 1961:

Wheat Production per Hectare

Time Series Distributions of Wheat

Before we dive into any time series decomposition, let us look at the distribution of wheat Production Quantity and Harvest Area first. Since the tonnes of the wheat produced and hectares used number in the hundred millions, we will make the production quantity and harvest area more manageable by passing it through the log(). Then we plot the differences between each measurement with diff() to see how the time series elements differ from each other.

Plotting Wheat Distributions

The abline(0, 0) simply adds a line y=0 to designate no difference between points as a reference.

Most of the distribution for Production Quantity (red) lies above y=0, indicating that the differences between each element are mostly positive, and the over all trend of the time series is increasing more than it is decreasing. On the other hand, Harvest Area differences do not appear to lean towards either direction, and is more stable than Production Quantity. These observations are confirmed when looking back at the Wheat Time Series Plot.

Distribution Histograms

Another way to visualize the distributions is with a histogram. First, we plot production quantity with hist() and specify that it plot the probability density. Next we overlay the probability density line to show how distribution of probabilities on the histogram. Then we assume the differences follow a normal distribution so we calculate the mean and the standard deviation from the differences of the log of production and plot the parametric normal distribution over the histogram as well.

Histogram for Wheat Production Differences

Below we can see the rendered histogram for the differences of the log of wheat production with the density estimate and the normal distribution with production quantity parameters. According to the normal distribution, the mean lies above 0.0, indicating increasing successive production quantity values.

We can observe the more stable wheat harvest area distribution with the same code, only modified for column 1 of wheat, and different colors in the histogram.

Plotting Wheat Harvest Differences

Here we have a more balanced histogram of harvest area distribution, with both the estimated density and parametric normal distribution means above 0.0. This, again, shows that the wheat harvest area stayed relatively more stable over time than the wheat production quantity.

After analyzing the distributions of both wheat production quantity and harvest area, we have a better grasp of the wheat time series data. Now we can move on to time series decomposition- in the next post! (Had to cut this post here; it was getting too lengthy.)

So stay tuned, and stay curious!

Thanks for reading,

Wayne
@beyondvalence

Pages