# Hypothesis Testing, Confidence Intervals, Confidence Levels, Alfa level, Beta level and Power …all clearly explained

I recently found the best worded explanation of  Hypothesis Testing, Confidence Intervals, Confidence Levels,  Alfa level, Beta level and Power that I have every read.   All credit to YALE UNIVERSITY (USA) Department of Statistics !
Usually the students of Lean Six Sigma follow Power Point presentations and have a trainer to assist them through the Hypothesis Testing parts of Green Belt and Black Belt,  however if you take the time to read through the following 6 pages it gives you all of the core understanding you need prior to using Minitab to analyze your real-world problems.   I took the liberty of editing the original version from Yale to take out sentences that refer to other previous sections of their course work, and also items that are useful to pure mathematics however just confuse the subject for a “belt” using it in the real world.  I thoroughly recommend that any potential Green or Black belt have a go at reading the following.   I also added the explanation of when to use the z-score rather than the t-score,  which was not explained in the original published by Yale University.
Once again, I must say that all credit goes to the professor at Yale who took the time to explain this subject so elegantly.
JD

Hypothesis Testing, Confidence Intervals, Confidence Levels and Power:

We use statistics to make inferences about a population. We use statistics to be able to support our claims, and to refute claims that we want to challenge.

Probability allows us to take chance variation into account and so we can substantiate our conclusions by doing probability calculations. The test statistics that we use (such as z, and later t, and F) have unique sampling distributions which allow us to compare the probability of the occurrence of our sample data with what is believed to be true about the population.

The two most common types of statistical inference are significance testing and confidence intervals. These two methods of making inferences allow us to make claims or dispute claims based on collected data. These two types of statistical inferences have different goals:
Significance testing– used to assess the evidence provided by our data so that we can make some claim about a population.
Confidence intervals-used to estimate a population parameter
Assumptions we will hold for these procedures:

1. Data come from a random sample, a SRS (simple random sample)
2. The variable is normally distributed for the population

Significance Testing

We use the Z-statistic ( Z-score) when we want to test the Significance of a difference between an average of a SAMPLE against the average of POPULATION and we KNOW the standard deviation of the Population.   With a Z-test we are using a Normal distribution curve because we have confidence in the distribution because we know the standard deviation of the population.

We use the t-statistic (  t-score ) when we DO NOT KNOW the standard deviation of population and so we need to use the standard deviation of the sample.  The 1-sample t-test is very similar to the Z-test,  where we want to test the Significance of a difference between an average of a SAMPLE against the average of POPULATION.  The difference is that the t-test uses the standard deviation of the sample not of the population.
The 2-sample t-test is where we want to test the Significance of a difference between an average of a SAMPLE against the average of another SAMPLE from the same population ( 2 -sample t-test ).
With t-tests ( unlike Z-tests) we are using a “family” of t-curves that change their shape depending on the sample size.   The larger the sample size, the more the curve approximates to a Normal Distribution curve ( as used by Z-tests ).

For the following example we are considering the SAMPLE compared to the POPULATION and we are told the standard deviation of the POPULATION ( 15) so we are concerned with using the Z-score and Z-test.

Example 1:
We are interested in investigating whether the IQ’s of Yale Students are higher than the general American population. We take a random sample of 35 Yale students, measure their IQ’s, and find that the mean=107. We know that IQ has a normal distribution and the mean (mu )=100 and sigma =15.
Procedure for carrying out significance test:

Null hypothesis ( ) (h-not)- the statistical hypothesis that we test. We are assessing the strength of the evidence of our data against the null hypothesis.

• this is always a statement about a population and so it will be expressed in terms of a parameter

H : u=100

Alternative hypothesis (H ) or (H ) -statement that specifies what we think is true. An expression that is incompatible with the null hypothesis.
H :u>100
In this example we have a one-sided alternative hypothesis. Usually it is appropriate to have a one-sided hypothesis when we have some strong a priori reason for making a one-sided claim. (We know that Yale students have above average SAT scores so their IQ’s are probably higher as well).
If we do not have a good a priori reason for making a one-sided hypothesis it is best to go with a two-sided hypothesis.
H :u=100 (there is a difference but we are not sure in which direction)
(Note: with a one-sided hypothesis we do a one-tailed test which generally has more power (ability to detect desired effect) than a two-tailed test. More on this later.

1. Calculate the test statistic – the Z Statistic

We know its distribution (see table A). It allows us to measure the compatibility between our data and the null hypothesis.
Z=
(in this example we know the population standard deviation, but in cases where it is unknown, for large samples (n> 25 or 30) we can uses the sample standard deviation as an estimate.
Z=
Z= 2.76

1. Find the P-value or Z-crit (the critical Z-score)

Look up the probability of getting a Z-score of 2.76 in Table A
.9971 Here we are doing a 1-tailed test in the > direction, so we know which half of the distribution we are looking at.
P-value is 1-.9971= .0029 Z-crit for alpha .05= 1.64
.0029 signifies that in 29 times out of 10,000 chances we would find x=107 under a mu of 100. In other words, our observed outcome from our data is highly unlikely under the supposition that the null hypothesis is true. This is far from what we would expect if the null hypothesis were true.
So we can conclude then that our observed data are more probable in terms of the alternative hypothesis being true.
(Important Note: Because we are always testing the null hypothesis, it is never correct to state that we have proved the alternative hypothesis. We have only rejected the null hypothesis.)
The smaller the P-value the stronger the evidence against the null hypothesis.
So how much evidence against the null hypothesis do we need before we can reject it?
Our evidence relates to a significance level, alpha ( ) that has been predetermined before we began analyzing our data. This is customarily .05, but it depends on circumstances and the degree of certainty which we desire.
The alpha level is the same thing as Type I error. If our alpha level is .05 then 5 times out of 100 we will reject the null hypothesis when we shouldn’t have.
Back to our example then: we compare our calculated P-value .0029 and see that it is less than our predetermined alpha level of .05, so we can reject the null hypothesis and conclude that our data is significant at the .05 level.
Also our calculated Z-score (2.76) is > the Z-crit (1.64) which confirms as well that we should reject the null hypothesis.
4.) State a conclusion
What does “significant” mean (Moore & McCabe, p.459)- it means signifying something, namely that Yalies’ IQ’ are significantly greater than the general population at the .05 level. We have rejected the null hypothesis.
Example 2: Two-sided hypothesis/ two-tailed test
An educational researcher is interested in investigating whether third-graders in the Hartford school district perform differently on the Degree of Reading Power test that the national average for third graders, which is 32. Degree of Reading Power (DRP) scores are recorded for a random sample of 44 third-grade students and the mean score is 35.1. DRP scores are approximately normal and the standard deviation for this school district is known to be 11. The researcher will work with an alpha level of .05.
H :u=32
H :u=32
Z=
Z=1.868, here because it is a two sided test, we must find the alpha level and multiply it by 2. So our P-value is 2( 1-.9693)= 2* .0307= .0614
.0614< .05, so we fail to reject the null hypothesis.
Here the Z-crit scores are + or – 1.96
If the researcher had reason to believe that the students DRP scores were higher than the national mean then we would do a one-tailed test
The Z-crit would be 1.64
Our calculated Z was 1.868 which is > 1.64, and our P-value no longer needs to be multiplied by 2 so it is .0307 which is < .05. In this case then we can reject the null hypothesis and conclude that children in this district have a mean score that is higher than the national mean.
Let’s switch over now to talk about confidence intervals, keeping this last example in mind.

# Confidence Intervals

We use confidence intervals to make estimates about population parameters. So far in the examples we have looked at there has been a known mean for the population. If the mean for a population is unknown we can make an educated guess with a certain amount of confidence based on our sample data.
When thinking about confidence intervals it is important to keep in mind the 68-95.99.7 rule. On the most basic level this rule signifies that 68% of our data will fall within 1 standard deviation of the mean, 95% will fall within 2 standard deviations of the mean and 99.7% will fall within 3 standard deviations of the mean, for a normally distributed variable.
Think of the IQ example. IQ has a population mean of 100 and a standard deviation of 15. So the range of IQ’s 85-115 account for 68% of all IQ scores.
This rule is easily applied to confidence intervals. If we don’t know a population mean but we have a sample mean we can use a confidence interval to say something about where the true population mean may fall. This rule states that for a normally distributed variable, there is probability of .95 that our sample mean will be within 2 standard deviation of the population mean.
So the form is: estimate + or – a margin of error.
The margin of error designates how accurate we believe our guess to be based on the variability of the estimate.
When using the Z-statistic the margin of error = the Z-value * sigma/ square root of n
Let’s use our example of third-graders DRP scores to see how this works.:
Imagine that we don’t know that nation mean for DRP scores, but we can predict with a certain amount of confidence what it might be based on our sample data.
Remember that mean=35.1 and the sigma=11.
Let’s calculate a 95% confidence interval for the mean reading score.
For a 95% confidence interval my Z-values will be + or – 1.96
CI: mean + or – Z * sigma/square root of n
CI: 35.1 -12.9948, 35.1 + 12.9948
CI: (22.1, 48.1) We conclude that we are 95% confident that the true mean of the population falls within 22.1 and 48.1. ( As we know from the information we were given by the national database, the population mean is actually 32.)
As Moore and McCabe try to explain (p.445) as well as Pollard, there is a hair-splitting distinction that is made when trying to understand what confidence intervals mean.
-When we say that we are 95% confident that the mean DRP score lies between 22.1 and 48.1 this does not signify that it is 95% probable that the true mean lies within this range, rather it signifies that this method produces the correct interval, for which the true mean lies within, in 95% of all possible samples.
Still confused? Try thinking about it this way:
(Sticking with the reading scores example) Say the researcher took another sample of 44 children from the district and found their mean DRP score to be 18. If we constructed a 95% confidence interval for the population mean based on this sample data it would come out to be (5, 31). We would say that the true population mean lies within this interval, but really we know that the true mean in 32. So this would be one of the 5% of the samples for which the confidence interval estimate would not contain the true mean.
Some things to keep in mind about confidence intervals:

1. Most of the time when we do confidence intervals we are working with estimating means. Remember that means are not resistant, so outliers can affect confidence intervals. (Try to justify removing or correcting for outliers before computing the sample mean that the confidence interval will be based on.)
2. Data must come from a simple random sample in order to construct a confidence interval.
3. The interval relies on the distribution of x which hopefully you can assume normality. Although when n is > or = 15 the confidence level is not greatly disturbed by non-normal populations unless extreme skewness or outliers are present.
4. If you do not know the standard deviation of a population and your sample size if large then you can use the sample standard deviation as an estimate for the population standard deviation in your margin of error.
5. Caution: the margin of error in a confidence interval includes only random sampling error (amount of error that can be expected because of chance variation).

Properties of confidence intervals :
-as C-level decreases, margin of error decreases
-as sample size (n) increases, the lower your margin of (chance) error, the higher your confidence
-as population standard deviation decreases, the lower your margin of error
You can also determine how large your sample size should be to construct a confidence interval for a specified margin of error for a normal mean:
n= (Z* sigma/ m)squared
Example 3: How much corn do I need?
Crop researchers are interested in estimating the average amount of bushels of corn that a new variety of corn they are planting will yield. Cost is an important factor, so they want to know how many plots of corn they need to plant to be able to estimate the mean yield of bushels of corn within 4 bushels per acre with 90% confidence. Assume that sigma is 10.
n= (1.645* 10/4)squared
n= 16.91
So they need 17 plots of corn to estimate the mean yield within 4 bushels of corn per acre with 90% confidence.
Confidence intervals are useful because they are concerned with both the level of confidence and the margin of error. The significance level like the confidence level says how reliable a method is in repeated use.
But having high confidence (say 99%) is not very valuable if the interval is so wide that it includes most values of the parameter. Similarly, a test with a small alpha level (say .01) is not very useful if it almost never rejects the null hypothesis. What we need to be concerned with then is power.
Power
Power is the ability, for a fixed alpha level, that the significance test will reject the null hypothesis in favor of a particular alternative value of the parameter.
Power is directly related to Type II error, which, if you recall, is failing to reject the null hypothesis, when we should have rejected it.
Mathematically, power = 1- Type II Error
High power is what we want. The standard for power is usually .80, or 80% power. (Note: so desirable levels of Type II Error will be no more than .20, or 20%.
How do we calculate power?
Example 4:
A SRS of 500 Connecticut high school students’ SAT scores are taken. A teacher believes that the mean will be no more than 450, because that is the mean score for the North Eastern US. If the population standard deviation is 100 and the test rejects the null hypothesis at the 1% level of significance, determine whether this test is sufficiently sensitive (has enough power) to be able to detect an increase of 10 points in the population SAT scores.
Steps for calculating power:

1. State the null hypothesis, the alternative hypothesis, the particular alternative that we want to detect, and the alpha level.

H :u=450
H :u>450
The alternative of interest is u=460 at the 1% level of significance

1. Find the values of x that will lead us to reject H

-use the Z-statistic
Z= x-u/(sigma/sq. root of n)
Z=
Substitute the Z-score based on the appropriate alpha level

1. Calculate the probability of observing these values of x in favor of the alternative.

P(x> 460.4 when u=460) = P[x-u/(sigma/sq. root n)]
P(460.4-460/4.47) = P(Z> .0894) = 1-.5319= .4681
Here we have a power of 46.81%. This test is not very sensitive to a 10-point increase in the mean score. (Really this isn’t surprising since the standard deviation is 100)
So we have a power that we are not happy with, how do we increase power?
There are several ways to increase power:

1. Increase the sample size.

(Note: Now that you have the formula for calculating power, you can actually decide a prior on how much power you want and plug it into the equation to figure out the sample size that you will need to have the certain level of power. This is very economical because adding subjects can add a lot of expense to research)

1. Decrease the variation, which essentially has the same effect as increasing the sample size. You have a better chance of distinguishing mu ( ). Be very cautious in your measurement process and you may want to limit your population so that it is more homogenous.
2. Increase the alpha level. You will have more of a chance of rejecting the alternative at the 5% level of significance than a 1% test, because the strength of evidence required is less.
3. Consider an alternative that is farther away from the null hypothesis. Values of the population mean that are in the alternative but are close to the hypothesized value are harder to detect (lower power) than values of the population mean that are father from the hypothesized population mean.

One last thing: The Relationship Between Confidence Intervals and Significance Testing
Suppose that for college students in the Ivy League I want to know how many hours of TV per day (on average) a person watches. I have no idea what the population mean might be or the standard deviation and I want to construct a 99% confidence interval for the population mean.
I take a sample of 105 Yale Students and find a mean of 3.2 hours per day and a standard deviation of .8 hours per day.
The 99% CI is 3.2 + or – .201 so (3.0, 3.4)
Say your roommate comes along and agrees with your claim, but tells you that one of her professors did a study and found that the mean was 3.5. She definitely thinks that the mean hours watched are different than this value.
Here the H :u=3.5
H : u = 3.5
Because the hypothesized value falls outside the confidence interval we just computed, we can say that we can reject H at the 1% significance level (alpha= .01)
Your other friend comes along and claims that she did a similar study and found that her sample mean was 3.1.
Here the H :u=3.1
We cannot reject the null hypothesis here (for alpha level =.01) because 3.1 lies within the 99% confidence interval.