8.1: Hypothesis Testing and the P-Value
Learning Objectives
- Develop null and alternative hypotheses to test for a given situation.
- Understand the critical regions of a graph for one- and two-tailed hypothesis tests.
- Calculate a test statistic to evaluate a hypothesis.
- Test the probability of an event using the value.
- Understand Type I and Type II errors.
- Calculate the power of a test.
Introduction
In this chapter we will explore hypothesis testing, which involves making conjectures about a population based on a sample drawn from the population. Hypothesis tests are often used in statistics to analyze the likelihood that a population has certain characteristics. For example, we can use hypothesis testing to analyze if a senior class has a particular average SAT score or if a prescription drug has a certain proportion of the active ingredient.
A hypothesis is simply a conjecture about a characteristic or set of facts. When performing statistical analyses, our hypotheses provide the general framework of what we are testing and how to perform the test.
These tests are never certain and we can never prove or disprove hypotheses with statistics, but the outcomes of these tests provide information that either helps support or refute the hypothesis itself.
In this section we will learn about different hypothesis tests, how to develop hypotheses, how to calculate statistics to help support or refute the hypotheses and understand the errors associated with hypothesis testing.
Developing Null and Alternative Hypotheses
Hypothesis testing involves testing the difference between a hypothesized value of a population parameter and the estimate of that parameter which is calculated from a sample. If the parameter of interest is the mean of the populations in hypothesis testing, we are essentially determining the magnitude of the difference between the mean of the sample and they hypothesized mean of the population. If the difference is very large, we reject our hypothesis about the population. If the difference is very small, we do not. Below is an overview of this process.
In statistics, the hypothesis to be tested is called the null hypothesis and given the symbol The alternative hypothesis is given the symbol
The null hypothesis defines a specific value of the population parameter that is of interest. Therefore, the null hypothesis always includes the possibility of equality. Consider
In this situation if our sample mean, , is very different from 3.2 we would reject . That is, we would reject if is much larger than 3.2 or much smaller than 3.2. This is called a 2-tailed test. An that is very unlikely if is true is considered to be good evidence that the claim is not true. Consider . In this situation we would reject for very large values of . This is called a one tail test. If, for this test, our data gives , it would be highly unlikely that finding this different from 3.2 would occur by chance and so we would probably reject the null hypothesis in favor of the alternative hypothesis.
Example: If we were to test the hypothesis that the seniors had a mean SAT score of 1100 our null hypothesis would be that the SAT score would be equal to 1100 or:
We test the null hypothesis against an alternative hypothesis, which is given the symbol and includes the outcomes not covered by the null hypothesis. Basically, the alternative hypothesis states that there is a difference between the hypothesized population mean and the sample mean. The alternative hypothesis can be supported only by rejecting the null hypothesis. In our example above about the SAT scores of graduating seniors, our alternative hypothesis would state that there is a difference between the null and alternative hypotheses or:
Let’s take a look at examples and develop a few null and alternative hypotheses.
Example: We have a medicine that is being manufactured and each pill is supposed to have 14 milligrams of the active ingredient. What are our null and alternative hypotheses?
Solution:
Our null hypothesis states that the population has a mean equal to 14 milligrams. Our alternative hypothesis states that the population has a mean that is different than 14 milligrams. This is two tailed.
Example: The school principal wants to test if it is true what teachers say -- that high school juniors use the computer an average 3.2 hours a day. What are our null and alternative hypotheses?
Our null hypothesis states that the population has a mean equal to 3.2 hours. Our alternative hypothesis states that the population has a mean that differs from 3.2 hours. This is two tailed.
Deciding Whether to Reject the Null Hypothesis: One-Tailed and Two-Tailed Hypothesis Tests
When a hypothesis is tested, a statistician must decide on how much evidence is necessary in order to reject the null hypothesis. For example, if the null hypothesis is that the average height of a population is 64 inches a statistician wouldn't measure one person who is 66 inches and reject the hypothesis based on that one trial. It is too likely that the discrepancy was merely due to chance.
We use statistical tests to determine if the sample data give good evidence against the claim . The numerical measure that we use to determine the strength of the sample evidence we are willing to consider strong enough to reject is called the level of significance and it is denoted by . If we choose, for example, we are saying that we would get data at least as unusual as the data we have collected no more than 1% of the time when is true.
The most frequently used levels of significance are 0.05 and 0.01. If our data results in a statistic that falls within the region determined by the level of significance then we reject . The region is therefore called the critical region. When choosing the level of significance, we need to consider the consequences of rejecting or failing to reject the null hypothesis. If there is the potential for health consequences (as in the case of active ingredients in prescription medications) or great cost (as in the case of manufacturing machine parts), we should use a more ‘conservative’ critical region with levels of significance such as .005 or .001.
When determining the critical regions for a two-tailed hypothesis test, the level of significance represents the extreme areas under the normal density curve. We call this a two-tailed hypothesis test because the critical region is located in both ends of the distribution. For example, if there was a significance level of 0.95 the critical region would be the most extreme 5 percent under the curve with 2.5 percent on each tail of the distribution.
Therefore, if the mean from the sample taken from the population falls within one of these critical regions, we would conclude that there was too much of a difference between our sample mean and the hypothesized population mean and we would reject the null hypothesis. However, if the mean from the sample falls in the middle of the distribution (in between the critical regions) we would fail to reject the null hypothesis.
We calculate the critical region for the single-tail hypothesis test a bit differently. We would use a single-tail hypothesis test when the direction of the results is anticipated or we are only interested in one direction of the results. For example, a single-tail hypothesis test may be used when evaluating whether or not to adopt a new textbook. We would only decide to adopt the textbook if it improved student achievement relative to the old textbook. A single-tail hypothesis simply states that the mean is greater or less than the hypothesized value.
When performing a single-tail hypothesis test, our alternative hypothesis looks a bit different. When developing the alternative hypothesis in a single-tail hypothesis test we would use the symbols of greater than or less than. Using our example about SAT scores of graduating seniors, our null and alternative hypothesis could look something like:
In this scenario, our null hypothesis states that the mean SAT scores would be equal to 1100 while the alternate hypothesis states that the SAT scores would be greater than 1100. A single-tail hypothesis test also means that we have only one critical region because we put the entire region of rejection into just one side of the distribution. When the alternative hypothesis is that the sample mean is greater, the critical region is on the right side of the distribution. When the alternative hypothesis is that the sample is smaller, the critical region is on the left side of the distribution (see below).
To calculate the critical regions, we must first find the critical values or the cut-offs where the critical regions start. To find these values, we use the critical values found specified by the distribution. These values can be found in a table that lists the areas of each of the tails under a normal distribution. Using this table, we find that for a 0.05 significance level, our critical values would fall at 1.96 standard errors above and below the mean. For a 0.01 significance level, our critical values would fall at 2.57 standard errors above and below the mean. Using the distribution we can find critical values (as specified by standard scores) for any level of significance for either single-or two-tailed hypothesis tests.
Example: Determine the critical value for a single-tailed hypothesis test with a 0.05 significance level.
Using the distribution table, we find that a significance level of 0.05 corresponds with a critical value of 1.645. If alternative hypothesis is the mean is greater than a specified value the critical value would be 1.645. Due to the symmetry of the normal distribution, if the alternative hypothesis is the mean is less than a specified value the critical value would be -1.645.
Technology Note: Finding critical values on the TI83/84 Calculator
You can also find this critical value using the TI83/84 calculator: [DIST] invNorm(.05,0,1) returns -1.64485. The syntax for this is invNorm (area to the left, mean, standard deviation).
Calculating the Test Statistic
Before evaluating our hypotheses by determining the critical region and calculating the test statistic, we need confirm that the distribution is normal and determine the hypothesized mean of the distribution.
To evaluate the sample mean against the hypothesized population mean, we use the concept of scores to determine how different the two means are from each other. Based on the Central Limit theorem the distribution of is normal with mean, and standard deviation, . As we learned in previous lessons, the score is calculated by using the formula:
where:
standardized score
sample mean
the population mean under the null hypothesis
population standard deviation. If we do not have the population standard deviation and if , we can use the sample standard deviation, . If and we do not have the population sample standard deviation we use a different distribution which will be discussed in a future lesson.
Once we calculate the score, we can make a decision about whether to reject or to fail to reject the null hypothesis based on the critical values.
Following are the steps you must take when doing an hypothesis test:
- Determine the null and alternative hypotheses.
- Verify that necessary conditions are satisfied and summarize the data into a test statistic.
- Determine the level.
- Determine the critical region(s).
- Make a decision (Reject or fail to reject the null hypothesis)
- Interpret the decision in the context of the problem.
Example: College A has an average SAT score of 1500. From a random sample of 125 freshman psychology students we find the average SAT score to be 1450 with a standard deviation of 100. We want to know if these freshman psychology students are representative of the overall population. What are our hypotheses and the test statistic?
1. Let’s first develop our null and alternative hypotheses:
2. The test statistic is
3. Choose
4. This is a two sided test. If we choose , the critical values will be -1.96 and 1.96. (Use invNorm (.025, 0,1) and the symmetry of the normal distribution to determine these critical values) That is we will reject the null hypothesis if the value of our test statistic is less than -1.96 or greater than 1.96.
5. The value of the test statistic is -5.59. This is less than -1.96 and so our decision is to reject .
6. Based on this sample we believe that the mean is not equal to 1500.
Example: A farmer is trying out a planting technique that he hopes will increase the yield on his pea plants. Over the last 5 years the average number of pods on one of his pea plants was 145 pods with a standard deviation of 100 pods. This year, after trying his new planting technique, he takes a random sample of 144 of his plants and finds the average number of pods to be 147. He wonders whether or not this is a statistically significant increase. What are his hypotheses and the test statistic?
1. First, we develop our null and alternative hypotheses:
This alternative hypothesis is > since he believes that there might be a gain in the number of pods.
2. Next, we calculate the test statistic for the sample of pea plants.
3. If we choose
4. The critical value will be 1.645. (Use invNorm (.95, 0, 1) to determine this critical value) We will reject the null hypothesis if the test statistic is greater than 1.645. The value of the test statistic is 0.24.
5. This is less than 1.645 and so our decision is to accept .
6. Based on our sample we believe the mean is equal to 145.
Finding the P-Value of an Event
We can also evaluate a hypothesis by asking “what is the probability of obtaining the value of the test statistic we did if the null hypothesis is true?” This is called the value.
Example: Let’s use the example about the pea farmer. As we mentioned, the farmer is wondering if the number of pea pods per plant has gone up with his new planting technique and finds that out of a sample of 144 peas there is an average number of 147 pods per plant (compared to a previous average of 145 pods, the null hypothesis). To determine the p−value we ask what is P(z > .24)? That is, what is the probability of obtaining a z value greater than .24 if the null hypothesis is true? Using the calculator (normcdf (.24, 99999999, 0, 1) we find this probability to be .405. This indicates that there is a 40.5% chance that under the null hypothesis the peas will produce 147 or more pods.
Type I and Type II Errors
When we decide to reject or not reject the null hypothesis, we have four possible scenarios:
- The null hypothesis is true and we reject it.
- The null hypothesis is true and we do not reject it.
- The null hypothesis is false and we do not reject it.
- The null hypothesis is false and we reject it.
Two of these four possible scenarios lead to correct decisions: accepting the null hypothesis when it is true and rejections the null hypothesis when it is false.
Two of these four possible scenarios lead to errors: rejecting the null hypothesis when it is true and accepting the null hypothesis when it is false.
Which type of error is more serious depends on the specific research situation, but ideally both types of errors should be minimized during the analysis.
is true | is false | |
---|---|---|
Accept | Good Decision | Error (type II) |
Reject | Error (type I) | Good Decision |
The general approach to hypothesis testing focuses on the Type I error: rejecting the null hypothesis when it may be true. The level of significance, also known as the alpha level, is defined as the probability of making a Type I error when testing a null hypothesis. For example, at the 0.05 level, we know that the decision to reject the hypothesis may be incorrect 5 percent of the time.
Calculating the probability of making a Type II error is not as straightforward as calculating the probability of making a Type I error. The probability of making a Type II error can only be determined when values have been specified for the alternative hypothesis. The probability of making a type II error is denoted by .
Once the value for the alternative hypothesis has been specified, it is possible to determine the probability of making a correct decision . This quantity, , is called the power of the test.
The goal in hypothesis testing is to minimize the potential of both Type I and Type II errors. However, there is a relationship between these two types of errors. As the level of significance or alpha level increases, the probability of making a Type II error decreases and vice versa.
On the Web
http://tinyurl.com/35zg7du This link leads you to a graphical explanation of the relationship between and
Often we establish the alpha level based on the severity of the consequences of making a Type I error. If the consequences are not that serious, we could set an alpha level at 0.10 or 0.20. However, in a field like medical research we would set the alpha level very low (at 0.001 for example) if there was potential bodily harm to patients. We can also attempt minimize the Type II errors by setting higher alpha levels in situations that do not have grave or costly consequences.
Calculating the Power of a Test
The power of a test is defined as the probability of rejecting the null hypothesis when it is false (that is, making the correct decision). Obviously, we want to maximize this power if we are concerned about making Type II errors. To determine the power of the test, there must be a specified value for the alternative hypothesis.
Example: Suppose that a doctor is concerned about making a Type II error only if the active ingredient in the new medication is greater than 3 milligrams higher than what was specified in the null hypothesis (say, 250 milligrams with a sample of 200 and a standard deviation of 50). Now we have values for both the null and the alternative hypotheses.
By specifying a value for the alternative hypothesis, we have selected one of the many values for . In determining the power of the test, we must assume that is true and determine whether we would correctly reject the null hypothesis
Calculating the exact value for the power of the test requires determining the area above the critical value set up to test the null hypothesis when it is re-centered around the alternative hypothesis. If we have an alpha level of .05 our critical value would be 1.645 for the one tailed test. Therefore,
Solving for we find:
Now, with a new mean set at the alternative hypothesis we want to find the value of the critical score when centered around this score when we center this around the population mean of the alternative hypothesis, . Therefore, we can figure that:
Recall that we reject the null hypothesis if the critical value is to the right of .79. The question now is what is the probability of rejecting the null hypothesis when, in fact, the alternative hypothesis is true? We need to find the area to the right of 0.79. You can find this area using a table or using the calculator with the Normcdf command (Invnorm (0.79, 9999999, 0, 1)). The probability is .2148. This means that since we assumed the alternative hypothesis to be true, there is only a 21.5% chance of rejecting the null hypothesis. Thus, the power of the test is .2148. In other words, this test of the null hypothesis is not very powerful and has only a 0.2148 probability of detecting the real difference between the two hypothesized means.
There are several things that affect the power of a test including:
- Whether the alternative hypothesis is a single-tailed or two-tailed test.
- The level of significance
- The sample size.
On the Web
http://intuitor.com/statistics/CurveApplet.html Experiment with changing the sample size and the distance between the null and alternate hypotheses and discover what happens to the power.
Lesson Summary
Hypothesis testing involves making a conjecture about a population based on a sample drawn from the population.
We establish critical regions based on level of significance or alpha level. If the value of the test statistic falls in these critical regions, we make the decision to reject the null hypothesis.
To evaluate the sample mean against the hypothesized population mean, we use the concept of scores to determine how different the two means are.
When we make a decision about a hypothesis, there are four different outcome and possibilities and two different types of errors. A Type I error is when we reject the null hypothesis when it is true and a Type II error is when we do not reject the null hypothesis, even when it is false. , the level of significance of the test, is the probability of rejecting the null hypothesis when, in fact, the null hypothesis is true (an error).
The power of a test is defined as the probability of rejecting the null hypothesis when it is false (in other words, making the correct decision). We determine the power of a test by assigning a value to the alternative hypothesis and using the score to calculate the probability of rejecting the null hypothesis when it is false. It is the probability of making a Type II error.
Multimedia Links
For an illustration of the use of the p-value in statistics (4.0) and how to interpret it (18.0), see UCMSCI, Understanding the P-Value (4:04)
Review Questions
- If the difference between the hypothesized population mean and the mean of the sample is large, we ___ the null hypothesis. If the difference between the hypothesized population mean and the mean of the sample is small, we ___ the null hypothesis.
- At the Chrysler manufacturing plant, there is a part that is supposed to weigh precisely 19 pounds. The engineers take a sample of parts and want to know if they meet the weight specifications. What are our null and alternative hypotheses?
- In a hypothesis test, if the difference between the sample mean and the hypothesized mean divided by the standard error falls in the middle of the distribution and in between the critical values, we ___ the null hypothesis. If this number falls in the critical regions and beyond the critical values, we ___ the null hypothesis.
- Use the distribution table to determine the critical value for a single-tailed hypothesis test with a 0.01 significance level.
- Sacramento County high school seniors have an average SAT score of 1020. From a random sample of 144 Sacramento High School students we find the average SAT score to be 1100 with a standard deviation of 144. We want to know if these high school students are representative of the overall population. What are our hypotheses and the test statistic?
- During hypothesis testing, we use the value to predict the ___ of an event occurring if the null hypothesis is true.
- A survey shows that California teenagers have an average of $500 in savings (standard error 100). What is the probability that a randomly selected teenager will have savings greater than $520?
- Fill in the types of errors missing from the table below:
Decision Made | Null Hypothesis is True | Null Hypothesis is False |
---|---|---|
Reject Null Hypothesis | (1) ___ | Correct Decision |
Do not Reject Null Hypothesis | Correct Decision | (2) ___ |
- The __ is defined as the probability of rejecting the null hypothesis when it is false (making the correct decision). We want to maximize__if we are concerned about making Type II errors.
- The Governor’s economic committee is investigating average salaries of recent college graduates in California. They decide to test the null hypothesis that the average salary is $24,500 (standard deviation is $4,800)) and is concerned with making a Type II error only if the average salary is less than $25,000. For an and a sample of 144 determine the power of a one-tailed test.