8.1: Hypothesis Testing and the P-Value
Learning Objectives
- Develop null and alternative hypotheses to test for a given situation.
- Understand the critical regions of a graph for one- and two-tailed hypothesis tests.
- Calculate a test statistic to evaluate a hypothesis.
- Test the probability of an event using the \begin{align*}P\end{align*}-value.
- Understand type I and type II errors.
- Calculate the power of a test.
Introduction
In this chapter, we will explore hypothesis testing, which involves making conjectures about a population based on a sample drawn from the population. Hypothesis tests are often used in statistics to analyze the likelihood that a population has certain characteristics. For example, we can use hypothesis testing to analyze if a senior class has a particular average SAT score or if a prescription drug has a certain proportion of the active ingredient.
A hypothesis is simply a conjecture about a characteristic or set of facts. When performing statistical analyses, our hypotheses provide the general framework of what we are testing and how to perform the test.
These tests are never certain, and we can never prove or disprove hypotheses with statistics, but the outcomes of these tests provide information that either helps support or refute the hypothesis itself.
In this section, we will learn about different hypothesis tests, how to develop hypotheses, how to calculate statistics to help support or refute the hypotheses, and how to better understand the errors associated with hypothesis testing.
Developing Null and Alternative Hypotheses
Hypothesis testing involves testing the difference between a hypothesized value of a population parameter and the estimate of that parameter, which is calculated from a sample. If the parameter of interest is the mean of the population in hypothesis testing, we are essentially determining the magnitude of the difference between the mean of the sample and the hypothesized mean of the population. If the difference is very large, we reject our hypothesis about the population. If the difference is very small, we do not. Below is an overview of this process.
In statistics, the hypothesis to be tested is called the null hypothesis and is given the symbol \begin{align*}H_0\end{align*}. The alternative hypothesis is given the symbol \begin{align*}H_a\end{align*}.
The null hypothesis defines a specific value of the population parameter that is of interest. Therefore, the null hypothesis always includes the possibility of equality. Consider the following:
\begin{align*}H_0: \mu & = 3.2\\ H_a: \mu & \neq 3.2\end{align*}
In this situation, if our sample mean, \begin{align*}\bar{x}\end{align*}, is very different from 3.2, we would reject \begin{align*}H_0\end{align*}. That is, we would reject \begin{align*}H_0\end{align*} if \begin{align*}\bar{x}\end{align*} is much larger than 3.2 or much smaller than 3.2. This is called a two-tailed test. An \begin{align*}\bar{x}\end{align*} that is very unlikely if \begin{align*}H_0\end{align*} is true is considered to be good evidence that the claim \begin{align*}H_0\end{align*} is not true. Consider \begin{align*}H_0: \mu \le 3.2\end{align*} and \begin{align*}H_a: \mu > 32\end{align*}. In this situation, we would reject \begin{align*}H_0\end{align*} for very large values of \begin{align*}\bar{x}\end{align*}. This is called a one-tailed test. If, for this test, our data gives \begin{align*}\bar{x}=15\end{align*}, it would be highly unlikely that finding an \begin{align*}\bar{x}\end{align*} this different from 3.2 would occur by chance, so we would probably reject the null hypothesis in favor of the alternative hypothesis.
Example: If we were to test the hypothesis that the seniors had a mean SAT score of 1100, our null hypothesis would be that the SAT score would be equal to 1100, or:
\begin{align*}H_0: \mu = 1100\end{align*}
We test the null hypothesis against an alternative hypothesis, which, as previously stated, is given the symbol \begin{align*}H_a\end{align*} and includes the outcomes not covered by the null hypothesis. Basically, the alternative hypothesis states that there is a difference between the hypothesized population mean and the sample mean. The alternative hypothesis can be supported only by rejecting the null hypothesis. In our example above about the SAT scores of graduating seniors, our alternative hypothesis would state the opposite of the null hypothesis, or:
\begin{align*}H_a: \mu \neq 1100\end{align*}
Let’s take a look at examples and develop a few null and alternative hypotheses.
Example: We have a medicine that is being manufactured, and each pill is supposed to have 14 milligrams of the active ingredient. What are our null and alternative hypotheses?
Solution:
\begin{align*}H_0 : \mu &=14\\ H_a : \mu &\neq 14\end{align*}
Our null hypothesis states that the population has a mean equal to 14 milligrams. Our alternative hypothesis states that the population has a mean that differs from 14 milligrams. This is a two-tailed test.
Example: A school principal wants to test if it is true what teachers say\begin{align*}-\end{align*}that high school juniors use the computer an average 3.2 hours a day. What are our null and alternative hypotheses?
\begin{align*}H_0: \mu &= 3.2\\ H_a: \mu & \neq 3.2\end{align*}
Our null hypothesis states that the population has a mean equal to 3.2 hours. Our alternative hypothesis states that the population has a mean that differs from 3.2 hours. This is also a two-tailed test.
Deciding Whether to Reject the Null Hypothesis: One-Tailed and Two-Tailed Hypothesis Tests
When a hypothesis is tested, a statistician must decide on how much evidence is necessary in order to reject the null hypothesis. For example, if the null hypothesis is that the average height of a population is 64 inches, a statistician wouldn't measure one person who is 66 inches and reject the hypothesis based on this one trial. It is too likely that the discrepancy was merely due to chance.
We use statistical tests to determine if the sample data give good evidence against the \begin{align*}H_0\end{align*}. The numerical measure that we use to determine the strength of the sample evidence we are willing to consider strong enough to reject \begin{align*}H_0\end{align*} is called the level of significance, and it is denoted by \begin{align*}\alpha\end{align*}. If we choose, for example, \begin{align*}\alpha=0.01\end{align*}, we are saying that the data we have collected would happen no more than 1% of the time when \begin{align*}H_0\end{align*} is true.
The most frequently used levels of significance are 0.05 and 0.01. If our data results in a statistic that falls within the region determined by the level of significance, then we reject \begin{align*}H_0\end{align*}. Therefore, the region is called the critical region. When choosing the level of significance, we need to consider the consequences of rejecting or failing to reject the null hypothesis. If there is the potential for health consequences (as in the case of active ingredients in prescription medications) or great cost (as in the case of manufacturing machine parts), we should use a more conservative critical region, with levels of significance such as 0.005 or 0.001.
When determining the critical regions for a two-tailed hypothesis test, the level of significance represents the extreme areas under the normal density curve. This is a two-tailed hypothesis test, because the critical region is located in both ends of the distribution. For example, if there was a significance level of 0.05, the critical region would be the most extreme 5 percent under the curve, with 2.5 percent on each tail of the distribution.
Therefore, if the mean from the sample taken from the population falls within one of these critical regions, we would conclude that there was too much of a difference between our sample mean and the hypothesized population mean, and we would reject the null hypothesis. However, if the mean from the sample falls in the middle of the distribution (in between the critical regions), we would fail to reject the null hypothesis.
We calculate the critical region for a single-tail hypothesis test a bit differently. We would use a single-tail hypothesis test when the direction of the results is anticipated or we are only interested in one direction of the results. For example, a single-tail hypothesis test may be used when evaluating whether or not to adopt a new textbook. We would only decide to adopt the textbook if it improved student achievement relative to the old textbook. A single-tail hypothesis simply states that the mean is greater or less than the hypothesized value.
When performing a single-tail hypothesis test, our alternative hypothesis looks a bit different. When developing the alternative hypothesis in a single-tail hypothesis test, we would use the symbols for greater than or less than. Using our example about SAT scores of graduating seniors, our null and alternative hypothesis would look something like:
\begin{align*}H_0: \mu &= 1100\\ H_a: \mu & > 1100\end{align*}
In this scenario, our null hypothesis states that the mean SAT score would be equal to 1100, while the alternate hypothesis states that the mean SAT score would be greater than 1100. A single-tail hypothesis test also means that we have only one critical region, because we put the entire region of rejection into just one side of the distribution. When the alternative hypothesis is that the sample mean is greater, the critical region is on the right side of the distribution. When the alternative hypothesis is that the sample is smaller, the critical region is on the left side of the distribution (see below).
To calculate the critical regions, we must first find the cut-offs, or the critical values, where the critical regions start. These values are specified by the \begin{align*}z\end{align*}-distribution and can be found in a table that lists the areas of each of the tails under a normal distribution. Using this table, we find that for a 0.05 significance level, our critical values would fall at 1.96 standard errors above and below the mean. For a 0.01 significance level, our critical values would fall at 2.57 standard errors above and below the mean. Using the \begin{align*}z\end{align*}-distribution, we can find critical values (as specified by standard \begin{align*}z\end{align*}-scores) for any level of significance for either single-tailed or two-tailed hypothesis tests.
Example: Determine the critical value for a single-tailed hypothesis test with a 0.05 significance level.
Using the \begin{align*}z\end{align*}-distribution table, we find that a significance level of 0.05 corresponds with a critical value of 1.645. If our alternative hypothesis is that the mean is greater than a specified value, the critical value would be 1.645. Due to the symmetry of the normal distribution, if the alternative hypothesis is that the mean is less than a specified value, the critical value would be \begin{align*}-1.645\end{align*}.
Technology Note: Finding Critical \begin{align*}z\end{align*}-Values on the TI-83/84 Calculator
You can also find this critical value using the TI-83/84 calculator as follows: Press [2ND][DISTR], choose 'invNorm(', enter 0.05, 0, and 1, separated by commas, and press [ENTER]. This returns \begin{align*}-1.64485\end{align*}. The syntax for the 'invNorm(' command is 'invNorm (area to the left, mean, standard deviation)'.
Calculating the Test Statistic
Before evaluating our hypotheses by determining the critical region and calculating the test statistic, we need to confirm that the distribution is normal and determine the hypothesized mean, \begin{align*}\mu\end{align*}, of the distribution.
To evaluate the sample mean against the hypothesized population mean, we use the concept of \begin{align*}z-\end{align*}scores to determine how different the two means are from each other. Based on the Central Limit Theorem, the distribution of \begin{align*}\overline{x}\end{align*} is normal, with mean, \begin{align*}\mu\end{align*}, and standard deviation, \begin{align*}\frac{\sigma}{\sqrt{n}}\end{align*}. As we learned in previous lessons, the \begin{align*}z\end{align*}-score is calculated by using the following formula:
\begin{align*}z=\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\end{align*}
where:
\begin{align*}z\end{align*} is the standardized score.
\begin{align*}\bar{x}\end{align*} is the sample mean.
\begin{align*}\mu\end{align*} is the population mean under the null hypothesis.
\begin{align*}\sigma\end{align*} is the population standard deviation.
If we do not have the population standard deviation, and if \begin{align*}n \ge 30\end{align*}, we can use the sample standard deviation, \begin{align*}s\end{align*}. If \begin{align*}n < 30\end{align*} and we do not have the population sample standard deviation, we use a different distribution, which will be discussed in a future lesson.
Once we calculate the \begin{align*}z\end{align*}-score, we can make a decision about whether to reject or to fail to reject the null hypothesis based on the critical values.
The following are the steps you must take when doing a hypothesis test:
- Determine the null and alternative hypotheses.
- Verify that the necessary conditions are satisfied, and summarize the data into a test statistic.
- Determine the \begin{align*}\alpha\end{align*} level.
- Determine the critical region(s).
- Make a decision (reject or fail to reject the null hypothesis).
- Interpret the decision in the context of the problem.
Example: College A has an average SAT score of 1500. From a random sample of 125 freshman psychology students, we find the average SAT score to be 1450, with a standard deviation of 100. We want to know if these freshman psychology students are representative of the overall population. What are our hypotheses and test statistic?
1. Let’s first develop our null and alternative hypotheses:
\begin{align*}H_0: \mu &= 1500\\ H_a: \mu &\neq 1500\end{align*}
2. The test statistic is \begin{align*}z=\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}=\frac{1450-1500}{\frac{100}{\sqrt{125}}} \approx -5.59\end{align*}.
3. Now we choose \begin{align*}\alpha=0.05\end{align*}.
4. This is a two-tailed test. If we choose \begin{align*}\alpha=0.05\end{align*}, the critical values will be \begin{align*}-1.96\end{align*} and 1.96. (Use 'invNorm(0.025,0,1)' and the symmetry of the normal distribution to determine these critical values.) That is, we will reject the null hypothesis if the value of our test statistic is less than \begin{align*}-1.96\end{align*} or greater than 1.96.
5. The value of the test statistic is \begin{align*}-5.59\end{align*}. This is less than \begin{align*}-1.96\end{align*}, so our decision is to reject \begin{align*}H_0\end{align*}.
6. Based on this sample, we believe that the mean is not equal to 1500.
Example: A farmer is trying out a planting technique that he hopes will increase the yield of his pea plants. Over the last 5 years, the average number of pods on one of his pea plants was 145 pods, with a standard deviation of 100 pods. This year, after trying his new planting technique, he takes a random sample of 144 of his plants and finds the average number of pods to be 147. He wonders whether or not this is a statistically significant increase. What are his hypotheses and test statistic?
1. First, we develop our null and alternative hypotheses:
\begin{align*}H_0: \mu &= 145\\ H_a: \mu &> 145\end{align*}
This alternative hypothesis uses the '>' symbol, since the farmer believes that there might be a gain in the number of pods.
2. Next, we calculate the test statistic for the sample of pea plants:
\begin{align*}z=\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}=\frac{147-145}{\frac{100}{\sqrt{144}}} \approx 0.24\end{align*}
3. Now we choose \begin{align*}\alpha=0.05\end{align*}.
4. The critical value will be 1.645. (Use 'invNorm(0.95,0,1)' to determine this critical value.) We will reject the null hypothesis if the test statistic is greater than 1.645. The value of the test statistic is 0.24.
5. The test statistic is less than 1.645, so our decision is to fail to reject \begin{align*}H_0\end{align*}.
6. Based on our sample, we believe the mean is equal to 145.
Finding the \begin{align*}P\end{align*}-Value of an Event
We can also evaluate a hypothesis by asking, “What is the probability of obtaining the value of the test statistic that we did if the null hypothesis is true?” This is called the \begin{align*}P\end{align*}-value.
Example: Let’s use the example of the pea farmer. As we mentioned, the farmer is wondering if the number of pea pods per plant has gone up with his new planting technique and finds that out of a sample of 144 peas, there is an average number of 147 pods per plant (compared to a previous average of 145 pods). To determine the \begin{align*}P\end{align*}-value, we ask, "What is \begin{align*}P(z>0.24)\end{align*}?" That is, what is the probability of obtaining a \begin{align*}z\end{align*}-score greater than 0.24 if the null hypothesis is true? Using the 'normcdf(0.24,99999999,0,1)' command on a graphing calculator, we find this probability to be 0.41. This indicates that there is a 41% chance that under the null hypothesis, the peas will produce more than 145 pods. Since the \begin{align*}P\end{align*}-value is greater than \begin{align*}\alpha\end{align*}, we fail to reject the null hypothesis.
Type I and Type II Errors
When we decide to reject or not to reject the null hypothesis, we have four possible scenarios:
- The null hypothesis is true, and we reject it.
- The null hypothesis is true, and we do not reject it.
- The null hypothesis is false, and we do not reject it.
- The null hypothesis is false, and we reject it.
Two of these four possible scenarios lead to correct decisions: not rejecting the null hypothesis when it is true and rejecting the null hypothesis when it is false.
Two of these four possible scenarios lead to errors: rejecting the null hypothesis when it is true and not rejecting the null hypothesis when it is false.
Which type of error is more serious depends on the specific research situation, but ideally, both types of errors should be minimized during the analysis.
\begin{align*}H_0\end{align*} is true | \begin{align*}H_0\end{align*} is false | |
---|---|---|
Not Reject \begin{align*}H_0\end{align*} | Good Decision | Error (type II) |
Reject \begin{align*}H_0\end{align*} | Error (type I) | Good Decision |
The general approach to hypothesis testing focuses on the type I error: rejecting the null hypothesis when it may be true. The level of significance, also known as the alpha level, is defined as the probability of making a type I error when testing a null hypothesis. For example, at the 0.05 level, we know that the decision to reject the hypothesis may be incorrect 5 percent of the time.
\begin{align*}\alpha= P(\text{rejecting} \ H_0|H_0 \ \text{is true})=P(\text{making a type I error})\end{align*}
Calculating the probability of making a type II error is not as straightforward as calculating the probability of making a type I error. The probability of making a type II error can only be determined when values have been specified for the alternative hypothesis. The probability of making a type II error is denoted by \begin{align*}\beta\end{align*}.
\begin{align*}\beta= P(\text{not rejecting} \ H_0|H_0 \ \text{is false})=P(\text{making a type II error})\end{align*}
Once the value for the alternative hypothesis has been specified, it is possible to determine the probability of making a correct decision, which is \begin{align*}1-\beta\end{align*}. This quantity, \begin{align*}1-\beta\end{align*}, is called the power of a test.
The goal in hypothesis testing is to minimize the potential of both type I and type II errors. However, there is a relationship between these two types of errors. As the level of significance, or alpha level, increases, the probability of making a type II error \begin{align*}(\beta)\end{align*} decreases, and vice versa.
On the Web
http://tinyurl.com/35zg7du This link leads you to a graphical explanation of the relationship between \begin{align*}\alpha\end{align*} and \begin{align*}\beta\end{align*}.
Often we establish the alpha level based on the severity of the consequences of making a type I error. If the consequences are not that serious, we could set an alpha level at 0.10 or 0.20. However, in a field like medical research, we would set the alpha level very low (at 0.001, for example) if there was potential bodily harm to patients. We can also attempt minimize the type II errors by setting higher alpha levels in situations that do not have grave or costly consequences.
Calculating the Power of a Test
The power of a test is defined as the probability of rejecting the null hypothesis when it is false (that is, making the correct decision). Obviously, we want to maximize this power if we are concerned about making type II errors. To determine the power of the test, there must be a specified value for the alternative hypothesis.
Example: Suppose that a doctor is concerned about making a type II error only if the active ingredient in a new medication is less than 3 milligrams higher than what was specified in the null hypothesis (say, 250 milligrams, with a sample of 200 and a standard deviation of 50). In this case, we have values for both the null and the alternative hypotheses:
\begin{align*}H_0: \mu &= 250\\ H_a: \mu &= 253\end{align*}
By specifying a value for the alternative hypothesis, we have selected one of the many possible values for \begin{align*}H_a\end{align*}. In determining the power of the test, we must assume that \begin{align*}H_a\end{align*} is true and determine whether we would correctly reject the null hypothesis
Calculating the exact value for the power of the test requires determining the area above the critical value set up to test the null hypothesis when it is re-centered around the alternative hypothesis. If we have an alpha level of 0.05, our critical value would be 1.645 for a one-tailed test. Therefore, we can plug our numbers into the \begin{align*}z\end{align*}-score formula as follows:
\begin{align*}z &= \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\\ 1.645 &= \frac{\bar{x}-250}{\frac{50}{\sqrt{200}}}\end{align*}
Solving for \begin{align*}\bar{x}\end{align*}, we find: \begin{align*}\bar{x}=(1.645)(\frac{50}{\sqrt{200}})+ 250 \approx 255.8\end{align*}.
Now, with a new mean set at the alternative hypothesis, \begin{align*}H_a:\mu=253\end{align*}, we want to find the value of the critical score when we center our \begin{align*}\bar{x}\end{align*} around the population mean of the alternative hypothesis, \begin{align*}\mu=253\end{align*}. This can be done as follows:
\begin{align*}z=\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}=\frac{255.8-253}{\frac{50}{\sqrt{200}}} \approx 0.79\end{align*}
Recall that we reject the null hypothesis if the critical value is to the right of 0.79. The question now is, "What is the probability of rejecting the null hypothesis when, in fact, the alternative hypothesis is true?" We need to find the area to the right of 0.79. You can find this area by using a \begin{align*}z\end{align*}-table or by using the command 'normalcdf(0.79,9999999,0,1)' on a graphing calculator. It turns out that the probability is 0.2148. This means that since we assumed the alternative hypothesis to be true, there is only a 21.5% chance of rejecting the null hypothesis. Thus, the power of the test is 0.2148. In other words, this test of the null hypothesis is not very powerful and has only a probability of 0.2148 of detecting the real difference between the two hypothesized means.
There are several things that affect the power of a test, including:
- Whether the alternative hypothesis is a one-tailed or two-tailed test
- The level of significance, \begin{align*}\alpha\end{align*}
- The sample size
On the Web
http://intuitor.com/statistics/CurveApplet.html Experiment with changing the sample size and the distance between the null and alternate hypotheses and discovering what happens to the power.
Lesson Summary
Hypothesis testing involves making a conjecture about a population based on a sample drawn from the population.
We establish critical regions based on level of significance, or \begin{align*}\alpha\end{align*} level. If the value of the test statistic falls in one of these critical regions, we make the decision to reject the null hypothesis.
To evaluate the sample mean against the hypothesized population mean, we use the concept of \begin{align*}z\end{align*}-scores to determine how different the two means are.
When we make a decision about a hypothesis, there are four different possible outcomes and two different types of errors. A type I error is when we reject the null hypothesis when it is true, and a type II error is when we do not reject the null hypothesis, even when it is false. The level of significance of the test, \begin{align*}\alpha\end{align*}, is the probability of rejecting the null hypothesis when, in fact, it is true (an error).
The power of a test is defined as the probability of rejecting the null hypothesis when it is false (in other words, making the correct decision). We determine the power of a test by assigning a value to the alternative hypothesis and using the \begin{align*}z\end{align*}-score to calculate the probability of rejecting the null hypothesis when it is false. It is the probability of making a type II error subtracted from 1.
Multimedia Links
For an illustration of the use of the \begin{align*}P\end{align*}-value in statistics (4.0), see UCMSCI, Understanding the P-Value (4:04).
For an explanation of what \begin{align*}P\end{align*}-value is and how to interpret it (18.0), see UCMSCI, Understanding the P-Value (4:04).
Review Questions
- If the difference between the hypothesized population mean and the mean of a sample is large, we ___ the null hypothesis. If the difference between the hypothesized population mean and the mean of a sample is small, we ___ the null hypothesis.
- At the Chrysler manufacturing plant, there is a part that is supposed to weigh precisely 19 pounds. The engineers take a sample of the parts and want to know if they meet the weight specifications. What are our null and alternative hypotheses?
- In a hypothesis test, if the difference between the sample mean and the hypothesized mean divided by the standard error falls in the middle of the distribution and in-between the critical values, we ___ the null hypothesis. If this number falls in the critical regions and beyond the critical values, we ___ the null hypothesis.
- Use a \begin{align*}z\end{align*}-distribution table to determine the critical value for a single-tailed hypothesis test with a 0.01 significance level.
- Sacramento County high school seniors have an average SAT score of 1020. From a random sample of 144 Sacramento high school students, we find the average SAT score to be 1100 with a standard deviation of 144. We want to know if these high school students are representative of the overall population. What are our hypotheses and test statistic?
- During hypothesis testing, we use the \begin{align*}P\end{align*}-value to predict the ___ of an event occurring.
- A survey shows that California teenagers have an average of $500 in savings (standard error \begin{align*}=\end{align*} $100). What is the probability that a randomly selected teenager will have savings greater than $520?
- Fill in the types of errors missing from the table below:
Decision Made | Null Hypothesis is True | Null Hypothesis is False |
---|---|---|
Reject Null Hypothesis | (1) ___ | Correct Decision |
Do not Reject Null Hypothesis | Correct Decision | (2) ___ |
- The __ is defined as the probability of rejecting the null hypothesis when it is false (making the correct decision). We want to maximize__if we are concerned about making type II errors.
- The Governor’s economic committee is investigating average salaries of recent college graduates in California. It decides to test the null hypothesis that the average salary is $24,500 (with a standard deviation is $4,800) and is concerned with making a type II error only if the average salary is less than $25,000. In this case, \begin{align*}H_a: \mu=\$ 25,100\end{align*}. For \begin{align*}\alpha=0.05\end{align*} and a sample of 144, determine the power of a one-tailed test.