## Friday, August 30, 2013

### R1.2: Probability Distributions - Calculations for Statistical Distributions in R

In this post we will explore the different calculations we can apply on statistical distributions in R. We will cover:

### 1. Density and Point Probability

When we have a continuous distribution, such as a normal distribution (Gaussian), the density is a measure of the relative probability of getting a value close to x. So for a particular interval in the distribution, the probability of getting a value in the interval is the area under the curve.

For example, we can use the familiar 'bell-curve' otherwise known as a normal distribution. Through the console in R, we will create a plot showing the density of a normal distribution:

 Fig. 1: Creating a Density Plot for a Normal Distribution
For the x values, I used a sequence function where it repeats values from -4 to 4 in increments of 0.1. The y values are obtained with the dnorm function, which results in the density of a normal distribution given the x values. To make the plot visually pleasing, the plot type is changed to "l" for lines, for which we get:

 Fig. 2: Density plot of a Standard Normal Distribution
Shifting to discrete distributions, such as binomial distributions, the variables can only take distinct values, such as number of iPads sold. The number of iPads are distinct, because you cannot sell half an iPad, but only in whole quantities, such as: one, twelve, one hundred, etc. This is distinct from continuous where the variables can any value within the specified range, such as temperatures or mass. A binomial distribution is a special case of discrete distributions, where the outcomes are yes/no, success/failure, or of two possibilities. So for this binomial distribution, we will suppose number of independent trials at n = 50, and a probability of 'success' at 0.33.

 Fig. 3: R Code for Plotting a Binomial Distribution
Like the plotting the normal distribution, the y values in plotting the binomial distribution is given by dbinom(x, size=50, prob=0.33). The size corresponds to the n of 50, and the prob represents the probability of success at 0.33.

 Fig. 4: Plot of a Binomial Distribution, Histogram Type
Above in Figure 4, we see the plot of the binomial distribution with n = 50 and a probability at 0.33. We see the highest point probabilities centered around 16 and 17, where they have the highest chances of occurring. The chances of getting successes lower and higher than 16 or 17 (moving away from 16 or 17) drop as the point probabilities lower. This makes sense, as the probability was set at 0.33 with a size of 50 (0.33*50=16.5), and as this is a discrete distribution with whole success values so they must be highest around 16 and 17.

### 2. Cumulative Probability

Cumulative probability distributions express the probability of 'hitting' a specified 'x' or less in a given distribution. I will not show plotting cumulative probability distributions, because most of the time, actually numbers are desired. So if we have a non-standard normal distribution of blood pressures with a mean of 132 and a standard deviation of 13. Suppose we encounter a patient with a blood pressure of 160 (systolic, hopefully). Our 'x' in this situation is 160, so what is the probability of patients with a blood pressure of 160 and less?

 Fig. 5: Cumulative Probability of a Normal Distribution with u=132 and sd=13
The pnorm function gives us the cumulative probability in a normal distribution. From Figure 5, we see that 98.44% of patients have a blood pressure of 160 and lower. If we subtract it from 1, we could observe that 1.56% of patients have a blood pressure of 160 or higher.

For discrete distributions, the cumulative probability holds the similar, as we use the
pbinom function. Suppose we have 20 people choosing between Coke (A) and Pepsi (B), and say the preference is equal (p=0.5) due to a blind test. In the test, 16 people chose option A, Coke. How likely is it that Coke was chosen by 16 people over Pepsi?

 Fig. 6: Cumulative Probability of a Binomial Distribution with n=20 and prob=0.5
Using the pbinom function, with size being the number of people and prob being the probability of the first option, we see that we have a 99.87% probability that 16 people or less would choose Coke. What is the probability of getting 16 or more people? that would be using pbinom at 15, instead of 16 because we want to include 16 when we subtract the probability from 1. So chances of 16 or more people choosing Coke over Pepsi is 0.59%.

But what if we did not know which was better before the test, then we require adding the probability of extreme results going the other way. So we would include the chances of 4 or less people choosing option A (Coke) over B (Pepsi).

 Fig. 7: Cumulative Two-Tailed Probability of a Binomial Distribution, n=20, prob=0.5
While incorporating the probability of 16 or more in Figure 6, we would also include the the probability of 4 or less as shown in Figure 7. The two-tailed probability of 16 people choosing option A out of 20 people with a probability of 0.5 is 1.18%, which is twice the one-tailed probability.

### 3. Quantiles Quantiles are the inverse of cumulative distribution functions, because a p-quantile finds the value where there is probability p of getting the value less or equal to it. So the median, is the value at which the distribution has half of its values less than it, also known as the 50% quantile. In the back of statistics books, you most likely will find fixed sets of probabilities tables that show the boundary at which a test statistic must pass to become  significant at that level. They can range from 90%, 95%, or 99%, with the most often used at 95% level.

 Fig. 8: Formula for the 95% Confidence Interval of a Normal Distribution
Here in Figure 8, we have the formula for a 95% confidence interval (CI). Given the vitals of a normal distribution, x bar (sample mean), sigma (standard deviation), n (sample size), we can derive the 95% CI for the u, which is the true mean. Suppose x bar= 83, sigma = 12, and n = 5 (small but okay). The standard error of the mean (sem) is the normalized factor adjusting the quantile coefficient (Figure 10) in the formula (Figure 8).

 Fig. 9: Code for Standard Error of the Mean and 95% CI
In the code above, we see that the sem (5.367) is calculated by dividing the standard deviation by the square root of the sample size. The sem is then multiplied by the 0.025 and 0.975 quantiles (so there is 2.5% on each tail) then added to the xbar to get the confidence interval about the true mean. So with the lower and upper 95% confidence interval of this sample, we are certain that 72.48 and 93.52 covers the true mean.

 Fig. 10: qnorm Functions and Quantile Values for Standard Normal Distribution
On a side note, the qnorm(k) function gives us the k-quantile value in a standard normal distribution where the % of values in the distribution is less than or equal to k. The 2.5% and 97.5% quantiles you will see often in statistics, which is usually abbreviated at -1.96 and 1.96, respectively. Between the 2.5% and 97.5% quantiles lies 95% of the values, hence, 95% confidence interval.

Here is some extra code for some fun:

 Fig. 11: Z Values, and 95% CI Calculation
What we have here in Figure 11 is the deconstruction of the 95% CI values into the Z scores, which is the standard 95% quantiles at -1.96 and 1.96. As you can see, the square root of n divided by the standard deviation (opposite of sigma/sqrt(n)) is multiplied to the difference of the quantile value and the xbar. Next we have the concatenation of the 95% quantiles and using that to multiply the standard error of the mean to put the lower and upper quantiles together.

### 4. Pseudo-random Numbers

Random numbers drawn from R do not seem random at all; they are set by seed numbers, that is why they are called pseudo-random. These generated random numbers behave for practical purposes, as random. So they can be used to simulate sets of 'randomly' generated numbers.

 Fig. 12: Randomly Generated Sample Distributions
In the sample distributions shown in Figure 12, we observe that the first two are two sets of 10 randomly generated numbers of a normal distribution. However, the numbers are different because the seed number is continually being used unless told to be reset, so the randomly generated numbers will not (likely) be the same. This shows that even when we generate a normal distribution randomly, we will not get the same values, which proves to be practical.

Secondly, we can also adjust the arguments in the random distribution functions, such as adjusting the mean to 7, and standard deviation to 5, or setting the success of a binomial distribution at 10 with a sample size of 20 and a probability of success at 0.5.

There are many more types of distributions and statistical tests available in R, which build on the fundamental calculations on distributions in this post. We will explore these in later posts.