Valence Analytics: probability distribution

Showing posts with label probability distribution. Show all posts

Sunday, March 9, 2014

R Time Series: World Wheat Production and Harvest Area, Part I

Hello Readers,

Welcome back folks. Today we will revisit time series through the lens of World Wheat Production Quantity and Harvest Area. I recently discovered a data repository called Data Market, which has data sets by industry, country, and other various topics and sources, including the United Nations.

This particular data set on wheat was provided by the Food and Agriculture Organization (FAO) of the United Nations. This UN organization helps developing and developed countries "improve agriculture, forestry, and fishing practices, and ensuring good nutrition and food security." They are headquartered in Rome, Italy.

*Tasty Wheat*

The world produced 651 million tons of wheat in 2010. The cereal grain production was behind maize (844 million tons), and rice (672 million tons). Wheat is the leading source of vegetable protein in human food, and also the trade for wheat surpasses all other crops in the world combined.

So wheat is an important crop, and many of us eat some form of that cereal grain everyday, whether it be as bread or cereal product. So let us start number munching!

Getting Wheat from (Data) Market to R

Many data sets exist on Data Market, and with a quick search, we can find our wheat production data. I found the wheat data sifting through in the Food and Agriculture Industry under the Crops header, shown below.

Food and Agriculture Industry

Here it is, and after scrolling down and clicking explore, we arrive at the wheat data page. Make sure to select "Area Harvested" in addition to the "Production Quantity" under "Element/Unit". Multiple formats are available for download under the Export tab. I used the CSV format separated via commas. Choose a file format with which you are familiar.

Wheat Harvest Area and Production Quantity

Now that we have the wheat CSV data, we can import it into R and begin the analysis.

Wheat Data

We saw what the data looked like online, and we would like to reproduce it in R. Let use plot the wheat time series data.

First read in the CSV file, and now we have 3 columns with 48 rows of data. The first column in wheat denotes years, the second- Area Harvested in hectares (Ha), and the third- Production Quantity in metric tonnes. Note the last row. It is a comment-NA and we need to remove it. Also, let us specify the beginning of the time series at year 1961 with 1 measurement per year.

We can do all this by selecting the elements of wheat to convert into a times series object. The first column is not required and the 48th row needs to be excluded, which gives us wheat[-48, 2:3]. The time units are described as starting in 1961, with 1 measurement per unit time.

Reading in Wheat CSV

Now we can plot the wheat data with plot(). First plot the Production Quantity in the second column of wheat, then add the Harvest Area in the first column with lines(). The default x axis tick marks only print by decade so I specified a custom x-axis with axis() and the pretty() function for aesthetics. It is a useful function for axis customization, along with legend() to clarify the plot lines.

Plotting Wheat

And below we have the world wheat production in tonnes (metric) and harvest area (Ha) graphic. Observe a general trend upwards in wheat production over 47 years, with a relatively stable harvest area. It appears that the amount of wheat produced per area has increased dramatically over the years.

Wheat production in 1961 measured about 222 million metric tonnes and in 2007, it had increased to over 600 million metric tonnes. Surprisingly, the Harvest Area remained relatively stable from 1961 at 204 million hectares to over 214 million hectares in 2007.

Perhaps this relationship would be better visualized in a plot of the ratio of the two series. Take the the production series and divide it by the harvest area to obtain the ratio.

Wheat Ratio Plot Code

Observe the ratio close to 1.09 at 1961 and rising to around 2.82 in 2007. This is a 260% rise in wheat production efficiency per hectare over the 47 years.

Those figures are obtained below by dividing the ratio from 2007 by the ratio from 1961:

Wheat Production per Hectare

Time Series Distributions of Wheat

Before we dive into any time series decomposition, let us look at the distribution of wheat Production Quantity and Harvest Area first. Since the tonnes of the wheat produced and hectares used number in the hundred millions, we will make the production quantity and harvest area more manageable by passing it through the log(). Then we plot the differences between each measurement with diff() to see how the time series elements differ from each other.

Plotting Wheat Distributions

The abline(0, 0) simply adds a line y=0 to designate no difference between points as a reference.

Most of the distribution for Production Quantity (red) lies above y=0, indicating that the differences between each element are mostly positive, and the over all trend of the time series is increasing more than it is decreasing. On the other hand, Harvest Area differences do not appear to lean towards either direction, and is more stable than Production Quantity. These observations are confirmed when looking back at the Wheat Time Series Plot.

Distribution Histograms

Another way to visualize the distributions is with a histogram. First, we plot production quantity with hist() and specify that it plot the probability density. Next we overlay the probability density line to show how distribution of probabilities on the histogram. Then we assume the differences follow a normal distribution so we calculate the mean and the standard deviation from the differences of the log of production and plot the parametric normal distribution over the histogram as well.

Histogram for Wheat Production Differences

Below we can see the rendered histogram for the differences of the log of wheat production with the density estimate and the normal distribution with production quantity parameters. According to the normal distribution, the mean lies above 0.0, indicating increasing successive production quantity values.

We can observe the more stable wheat harvest area distribution with the same code, only modified for column 1 of wheat, and different colors in the histogram.

Plotting Wheat Harvest Differences

Here we have a more balanced histogram of harvest area distribution, with both the estimated density and parametric normal distribution means above 0.0. This, again, shows that the wheat harvest area stayed relatively more stable over time than the wheat production quantity.

After analyzing the distributions of both wheat production quantity and harvest area, we have a better grasp of the wheat time series data. Now we can move on to time series decomposition- in the next post! (Had to cut this post here; it was getting too lengthy.)

So stay tuned, and stay curious!

Thanks for reading,

Wayne
@beyondvalence

Thursday, August 29, 2013

R1.1: Probability Distributions - Random Sampling, Combinatorics in R

In this post we will work with R to perform some statistical tests and sampling using probability distributions.

Not all data sets have exact distributions that match a specific probability density function. There is a certain degree of randomness and 'un-reproducibility' to it. For example, ask 100 people with iPads how many times they use their iPads to view their Twitter feeds, and we will get a set of seemingly random numbers. That is to say, the data is almost guaranteed not to fit a perfect statistical model, such as a normal distribution or a Poisson distribution. However, these theoretical statistical distributions can enable us to have a better understanding and view of the data we work with. Let us dive in!

1. Random Sampling

In R we can draw random samples from a specific distribution. A basic example would be picking numbers for the Powerball lottery. To get the jackpot, we would have to match all six numbers on our $2 ticket, with five of them ranging from 1 to 59, and the last red ball ranging from 1 to 35 (they are drawn from two separate sets of balls). Here is the drawing of the 'staggering' $217.2 million Powerball jackpot in February 2013:

The Powerball site gives the odds of winning the jackpot at dismal 1 in 175,223,510 (I do not like it either, but hey, a ticket is only two dollars). We will derive this number further into the post.

To simulate this in R, we can choose sample numbers. The code for the five white balls is followed by the code for the red powerball, with results shown.

Fig. 1: Powerball Number Sampling

The the sample function picks a specified number of samples (5 and 1) from a range of numbers (1 to 59 and 1 to 35). We can concatenate the two sets of numbers together with the concatenate function and assign the result to pbdrawing.

Fig. 2: All Six Numbers in a Sample Ticket

You might notice that the numbers 'drawn' the second time in Figure 2 were different than in Figure 1. This is because every time we use the sample function, we get a new set of sampled numbers. However, since we stored them into pbdrawing, the numbers in there will stay the same (unless we assign different numbers to it after).

The default for the sample function is not to replace the values selected. The Powerball numbers are not replaced in the actual drawing. We can tell R to replace by adding a third argument, replace=True. This would be applicable in coin tossing. Say we toss a coin 10 times, assuming equal probability. Just because we toss a heads does not mean we cannot obtain that result again. That is when we set replace to True.

Fig. 3: Results of Ten Coin Tosses

As you can see, we get 6 tails and 4 heads from 10 tosses. The result is not split down the middle, because it is random sampling. If we would like to split the probability unevenly, we can alter the probability of the events in the sample range we are picking from. Suppose we have an electronic prototype, say the next generation iPads, which has a 10% screen failure rate after a month of testing (ouch, glad it is still a prototype). We can take a random sample of 10 next-gen iPads and see how many screens failed at the end of the testing period.

Fig. 4: Successes and Failures

We represented the good iPads by "succ", short for success, and the iPads with bad screens with "fail". The additional code redefines the probability to be 0.9 for the first outcome (good screen!), and 0.1 (or 10%) for the second outcome of screen failure. So we have 9 iPads with good screens and 1 with a bad screen, which is about what we could expect in a random sample of 10 iPads.

2. Probability Calculations & Combinatorics

Let us go back to the Powerball example using combinatorics. To recap, Powerball drawings occur from drawing 5 numbers from balls labeled 1 through 59 and one red ball from balls labeled 1 through 35.

If we draw the first five white balls, how many different permutations would be possible? That is, how many different ways can we draw the 5 numbers out of the 59? Suppose we take into account the order of the numbers. Then we would calculate the permutation using factorials (where 4! is 4*3*2*1).

Fig. 5: Permutation of Picking 5 from 59 numbers

The number of ways we can pick from 59 white balls is 59, and after the first pick, we have 58 left (no replacement). And choosing from the 58 there are 58 different balls, etc. So we need 59*58*57*56*55. We can write it using factorials: 59!/54! works because the numbers 54 and below in the denominator cancel out the numbers 54 and below in the numerator. Therefore we are left with: the product of numbers 59 to 55, as prod(59:55)as shown in Figure 5. R tells us that there are 600,766,320 ways (in order) we can pick 5 numbers from 59 balls.

But in the Powerball lottery, it does not matter what order the numbers are picked (like in the YouTube video), you just need to match them. So order does not matter; and to calculate how many ways we can choose 5 numbers from 59 balls in any order, we need to understand combinations.

We start with the permutation we calculated previously. That large number (6 hundred million+), is the number of ways we can choose 5 numbers in unique order. If order does not matter, the number of ways will be less. So we can simply divide it with the number of ways the 5 chosen numbers can be ordered. The first lucky number has 5 locations, 1st, 2nd, 3rd, 4th, or 5th one chosen, and the second number has 4 locations, etc. Therefore, we take 5! for the number of balls we picked, and use it to divide the permutation.

Fig. 6: Combination of White Powerball Numbers

In Figure 6, we see the permutation, followed by the combination calculation, and then by the handy R function for combination, the choose function. We observe the same number of ways for the choose function and our manual calculation derived from permutation.

The mathematical formula for combination (also known as the binomial coefficient), where n is the number of elements and k is the number of picks:

Fig. 7: Number of k-Combinations in n-Elements

and is commonly read as "n choose k" (hence the name of the R function).

So when choose the final red ball from 35 balls, there is 35 different balls to choose from. We can apply the formula (however obvious the solution) to verify:

Fig. 8: Combinations of the Red Ball

Needless to say, the 34! cancels out all but the 35 in the 35! and 1! is simply 1, so the result is 35 different ways to choose a red ball from 35 of them. Now with different sets of balls, multiplying the number of ways to choose the white ones and the red one will give us the number of possible ways the Powerball lottery can draw its numbers.

Fig. 9: Combination of White Balls and Red Ball with Probability

As we can see, the result of 175,223,510 is the number of different combinations of Powerball ticket numbers. This matches the odds of winning the jackpot, as given on their website. The probability, frankly, does not look good. At all. It looks like we have a 0.000000571% chance of winning the jackpot with any random ticket (that is a lot of zeros).

However you have to play for a chance of winning (a ticket to dream, in my opinion). So best of luck when playing Powerball!

In the next post for Probability Distributions in R, we will cover calculations from different probability distributions.

Thanks for reading!

-Wayne

Pages