Valence Analytics: normal distribution

Showing posts with label normal distribution. Show all posts

Sunday, March 9, 2014

R Time Series: World Wheat Production and Harvest Area, Part I

Hello Readers,

Welcome back folks. Today we will revisit time series through the lens of World Wheat Production Quantity and Harvest Area. I recently discovered a data repository called Data Market, which has data sets by industry, country, and other various topics and sources, including the United Nations.

This particular data set on wheat was provided by the Food and Agriculture Organization (FAO) of the United Nations. This UN organization helps developing and developed countries "improve agriculture, forestry, and fishing practices, and ensuring good nutrition and food security." They are headquartered in Rome, Italy.

*Tasty Wheat*

The world produced 651 million tons of wheat in 2010. The cereal grain production was behind maize (844 million tons), and rice (672 million tons). Wheat is the leading source of vegetable protein in human food, and also the trade for wheat surpasses all other crops in the world combined.

So wheat is an important crop, and many of us eat some form of that cereal grain everyday, whether it be as bread or cereal product. So let us start number munching!

Getting Wheat from (Data) Market to R

Many data sets exist on Data Market, and with a quick search, we can find our wheat production data. I found the wheat data sifting through in the Food and Agriculture Industry under the Crops header, shown below.

Food and Agriculture Industry

Here it is, and after scrolling down and clicking explore, we arrive at the wheat data page. Make sure to select "Area Harvested" in addition to the "Production Quantity" under "Element/Unit". Multiple formats are available for download under the Export tab. I used the CSV format separated via commas. Choose a file format with which you are familiar.

Wheat Harvest Area and Production Quantity

Now that we have the wheat CSV data, we can import it into R and begin the analysis.

Wheat Data

We saw what the data looked like online, and we would like to reproduce it in R. Let use plot the wheat time series data.

First read in the CSV file, and now we have 3 columns with 48 rows of data. The first column in wheat denotes years, the second- Area Harvested in hectares (Ha), and the third- Production Quantity in metric tonnes. Note the last row. It is a comment-NA and we need to remove it. Also, let us specify the beginning of the time series at year 1961 with 1 measurement per year.

We can do all this by selecting the elements of wheat to convert into a times series object. The first column is not required and the 48th row needs to be excluded, which gives us wheat[-48, 2:3]. The time units are described as starting in 1961, with 1 measurement per unit time.

Reading in Wheat CSV

Now we can plot the wheat data with plot(). First plot the Production Quantity in the second column of wheat, then add the Harvest Area in the first column with lines(). The default x axis tick marks only print by decade so I specified a custom x-axis with axis() and the pretty() function for aesthetics. It is a useful function for axis customization, along with legend() to clarify the plot lines.

Plotting Wheat

And below we have the world wheat production in tonnes (metric) and harvest area (Ha) graphic. Observe a general trend upwards in wheat production over 47 years, with a relatively stable harvest area. It appears that the amount of wheat produced per area has increased dramatically over the years.

Wheat production in 1961 measured about 222 million metric tonnes and in 2007, it had increased to over 600 million metric tonnes. Surprisingly, the Harvest Area remained relatively stable from 1961 at 204 million hectares to over 214 million hectares in 2007.

Perhaps this relationship would be better visualized in a plot of the ratio of the two series. Take the the production series and divide it by the harvest area to obtain the ratio.

Wheat Ratio Plot Code

Observe the ratio close to 1.09 at 1961 and rising to around 2.82 in 2007. This is a 260% rise in wheat production efficiency per hectare over the 47 years.

Those figures are obtained below by dividing the ratio from 2007 by the ratio from 1961:

Wheat Production per Hectare

Time Series Distributions of Wheat

Before we dive into any time series decomposition, let us look at the distribution of wheat Production Quantity and Harvest Area first. Since the tonnes of the wheat produced and hectares used number in the hundred millions, we will make the production quantity and harvest area more manageable by passing it through the log(). Then we plot the differences between each measurement with diff() to see how the time series elements differ from each other.

Plotting Wheat Distributions

The abline(0, 0) simply adds a line y=0 to designate no difference between points as a reference.

Most of the distribution for Production Quantity (red) lies above y=0, indicating that the differences between each element are mostly positive, and the over all trend of the time series is increasing more than it is decreasing. On the other hand, Harvest Area differences do not appear to lean towards either direction, and is more stable than Production Quantity. These observations are confirmed when looking back at the Wheat Time Series Plot.

Distribution Histograms

Another way to visualize the distributions is with a histogram. First, we plot production quantity with hist() and specify that it plot the probability density. Next we overlay the probability density line to show how distribution of probabilities on the histogram. Then we assume the differences follow a normal distribution so we calculate the mean and the standard deviation from the differences of the log of production and plot the parametric normal distribution over the histogram as well.

Histogram for Wheat Production Differences

Below we can see the rendered histogram for the differences of the log of wheat production with the density estimate and the normal distribution with production quantity parameters. According to the normal distribution, the mean lies above 0.0, indicating increasing successive production quantity values.

We can observe the more stable wheat harvest area distribution with the same code, only modified for column 1 of wheat, and different colors in the histogram.

Plotting Wheat Harvest Differences

Here we have a more balanced histogram of harvest area distribution, with both the estimated density and parametric normal distribution means above 0.0. This, again, shows that the wheat harvest area stayed relatively more stable over time than the wheat production quantity.

After analyzing the distributions of both wheat production quantity and harvest area, we have a better grasp of the wheat time series data. Now we can move on to time series decomposition- in the next post! (Had to cut this post here; it was getting too lengthy.)

So stay tuned, and stay curious!

Thanks for reading,

Wayne
@beyondvalence

Friday, August 30, 2013

R1.2: Probability Distributions - Calculations for Statistical Distributions in R

In this post we will explore the different calculations we can apply on statistical distributions in R. We will cover:

Density and Point Probability

Cumulative Probability

Quantiles

Pseudo-random Numbers

1. Density and Point Probability

When we have a continuous distribution, such as a normal distribution (Gaussian), the density is a measure of the relative probability of getting a value close to x. So for a particular interval in the distribution, the probability of getting a value in the interval is the area under the curve.

For example, we can use the familiar 'bell-curve' otherwise known as a normal distribution. Through the console in R, we will create a plot showing the density of a normal distribution:

Fig. 1: Creating a Density Plot for a Normal Distribution

For the x values, I used a sequence function where it repeats values from -4 to 4 in increments of 0.1. The y values are obtained with the dnorm function, which results in the density of a normal distribution given the x values. To make the plot visually pleasing, the plot type is changed to "l" for lines, for which we get:

Fig. 2: Density plot of a Standard Normal Distribution

Shifting to discrete distributions, such as binomial distributions, the variables can only take distinct values, such as number of iPads sold. The number of iPads are distinct, because you cannot sell half an iPad, but only in whole quantities, such as: one, twelve, one hundred, etc. This is distinct from continuous where the variables can any value within the specified range, such as temperatures or mass. A binomial distribution is a special case of discrete distributions, where the outcomes are yes/no, success/failure, or of two possibilities. So for this binomial distribution, we will suppose number of independent trials at n = 50, and a probability of 'success' at 0.33.

Fig. 3: R Code for Plotting a Binomial Distribution

Like the plotting the normal distribution, the y values in plotting the binomial distribution is given by dbinom(x, size=50, prob=0.33). The size corresponds to the n of 50, and the prob represents the probability of success at 0.33.

Fig. 4: Plot of a Binomial Distribution, Histogram Type

Above in Figure 4, we see the plot of the binomial distribution with n = 50 and a probability at 0.33. We see the highest point probabilities centered around 16 and 17, where they have the highest chances of occurring. The chances of getting successes lower and higher than 16 or 17 (moving away from 16 or 17) drop as the point probabilities lower. This makes sense, as the probability was set at 0.33 with a size of 50 (0.33*50=16.5), and as this is a discrete distribution with whole success values so they must be highest around 16 and 17.

2. Cumulative Probability

Cumulative probability distributions express the probability of 'hitting' a specified 'x' or less in a given distribution. I will not show plotting cumulative probability distributions, because most of the time, actually numbers are desired. So if we have a non-standard normal distribution of blood pressures with a mean of 132 and a standard deviation of 13. Suppose we encounter a patient with a blood pressure of 160 (systolic, hopefully). Our 'x' in this situation is 160, so what is the probability of patients with a blood pressure of 160 and less?

Fig. 5: Cumulative Probability of a Normal Distribution with u=132 and sd=13

The pnorm function gives us the cumulative probability in a normal distribution. From Figure 5, we see that 98.44% of patients have a blood pressure of 160 and lower. If we subtract it from 1, we could observe that 1.56% of patients have a blood pressure of 160 or higher.

For discrete distributions, the cumulative probability holds the similar, as we use the pbinom function. Suppose we have 20 people choosing between Coke (A) and Pepsi (B), and say the preference is equal (p=0.5) due to a blind test. In the test, 16 people chose option A, Coke. How likely is it that Coke was chosen by 16 people over Pepsi?

Fig. 6: Cumulative Probability of a Binomial Distribution with n=20 and prob=0.5

Using the pbinom function, with size being the number of people and prob being the probability of the first option, we see that we have a 99.87% probability that 16 people or less would choose Coke. What is the probability of getting 16 or more people? that would be using pbinom at 15, instead of 16 because we want to include 16 when we subtract the probability from 1. So chances of 16 or more people choosing Coke over Pepsi is 0.59%.

But what if we did not know which was better before the test, then we require adding the probability of extreme results going the other way. So we would include the chances of 4 or less people choosing option A (Coke) over B (Pepsi).

Fig. 7: Cumulative Two-Tailed Probability of a Binomial Distribution, n=20, prob=0.5

While incorporating the probability of 16 or more in Figure 6, we would also include the the probability of 4 or less as shown in Figure 7. The two-tailed probability of 16 people choosing option A out of 20 people with a probability of 0.5 is 1.18%, which is twice the one-tailed probability.

3. Quantiles

Quantiles are the inverse of cumulative distribution functions, because a p-quantile finds the value where there is probability p of getting the value less or equal to it. So the median, is the value at which the distribution has half of its values less than it, also known as the 50% quantile. In the back of statistics books, you most likely will find fixed sets of probabilities tables that show the boundary at which a test statistic must pass to become significant at that level. They can range from 90%, 95%, or 99%, with the most often used at 95% level.

Fig. 8: Formula for the 95% Confidence Interval of a Normal Distribution

Here in Figure 8, we have the formula for a 95% confidence interval (CI). Given the vitals of a normal distribution, x bar (sample mean), sigma (standard deviation), n (sample size), we can derive the 95% CI for the u, which is the true mean. Suppose x bar= 83, sigma = 12, and n = 5 (small but okay). The standard error of the mean (sem) is the normalized factor adjusting the quantile coefficient (Figure 10) in the formula (Figure 8).

Fig. 9: Code for Standard Error of the Mean and 95% CI

In the code above, we see that the sem (5.367) is calculated by dividing the standard deviation by the square root of the sample size. The sem is then multiplied by the 0.025 and 0.975 quantiles (so there is 2.5% on each tail) then added to the xbar to get the confidence interval about the true mean. So with the lower and upper 95% confidence interval of this sample, we are certain that 72.48 and 93.52 covers the true mean.

Fig. 10: qnorm Functions and Quantile Values for Standard Normal Distribution

On a side note, the qnorm(k) function gives us the k-quantile value in a standard normal distribution where the % of values in the distribution is less than or equal to k. The 2.5% and 97.5% quantiles you will see often in statistics, which is usually abbreviated at -1.96 and 1.96, respectively. Between the 2.5% and 97.5% quantiles lies 95% of the values, hence, 95% confidence interval.

Here is some extra code for some fun:

Fig. 11: Z Values, and 95% CI Calculation

What we have here in Figure 11 is the deconstruction of the 95% CI values into the Z scores, which is the standard 95% quantiles at -1.96 and 1.96. As you can see, the square root of n divided by the standard deviation (opposite of sigma/sqrt(n)) is multiplied to the difference of the quantile value and the xbar. Next we have the concatenation of the 95% quantiles and using that to multiply the standard error of the mean to put the lower and upper quantiles together.

4. Pseudo-random Numbers

Random numbers drawn from R do not seem random at all; they are set by seed numbers, that is why they are called pseudo-random. These generated random numbers behave for practical purposes, as random. So they can be used to simulate sets of 'randomly' generated numbers.

Fig. 12: Randomly Generated Sample Distributions

In the sample distributions shown in Figure 12, we observe that the first two are two sets of 10 randomly generated numbers of a normal distribution. However, the numbers are different because the seed number is continually being used unless told to be reset, so the randomly generated numbers will not (likely) be the same. This shows that even when we generate a normal distribution randomly, we will not get the same values, which proves to be practical.

Secondly, we can also adjust the arguments in the random distribution functions, such as adjusting the mean to 7, and standard deviation to 5, or setting the success of a binomial distribution at 10 with a sample size of 20 and a probability of success at 0.5.

There are many more types of distributions and statistical tests available in R, which build on the fundamental calculations on distributions in this post. We will explore these in later posts.

Thanks for reading,

Wayne

Pages