# 8.1: Hypothesis Testing and the P-Value

**At Grade**Created by: CK-12

## Learning Objectives

- Develop null and alternative hypotheses to test for a given situation.
- Understand the critical regions of a graph for single- and two-tailed hypothesis tests.
- Calculate a test statistic to evaluate a hypothesis.
- Test the probability of an event using the \begin{align*}P\end{align*}-value.
- Understand Type I and Type II errors.
- Calculate the power of a test.

## Introduction

In this chapter we will explore **hypothesis testing,** which involves making educated guesses about a population based on a sample drawn from the population. Most times, hypothesis testing involves making guesses about the difference between the hypothesized value of the mean of an overall population and that of the sample. This is often used in statistics to analyze the likelihood that a population has certain characteristics. For example, we can use hypothesis testing to analyze if a senior class has a particular average SAT score or if a prescription drug has a certain proportion of the active ingredient.

A hypothesis is simply an educated guess about a characteristic or set of facts. When performing statistical analyses, our hypotheses provide the general framework of what we are testing and how to perform the test. These tests are never certain and we can never *prove* or *disprove* hypotheses with statistics, but the outcomes of these tests provide information that either helps support or refute the hypothesis itself.

In this section we will learn about the different types of hypothesis testing, how to develop hypotheses, how to calculate statistics to help support or refute the hypotheses and understand the errors associated with hypothesis testing.

## Developing Null and Alternative Hypotheses

As mentioned in the introduction, hypothesis testing involves testing the difference between a hypothesized value of the mean of an overall population and the mean calculated from a sample. In hypothesis testing, we are essentially determining the magnitude of the difference between the mean of the sample and they hypothesized mean of the population. If the difference is very large, we reject our hypothesis about the population. If the difference is very small, we do not. Below is an overview of this process.

In statistics, the hypothesis to be tested is called the **null hypothesis** and given the symbol \begin{align*}H_{0}\end{align*}. The null hypothesis states that there is no relationship or no difference between an accepted population mean and a sample mean. So finding a significant result means refuting the null hypothesis, showing that the true population mean is likely to be closer to the sample mean. We would calculate the mean of the sample and generalize these findings to the overall population. For example, if we were to test the hypothesis that the seniors had a mean SAT score of \begin{align*}1,100,\end{align*} our null hypothesis would be that the SAT score would be equal to \begin{align*}1,100\end{align*} or:

\begin{align*}H_{0}: \mu=1100\end{align*}

where:

\begin{align*}H_{0} = \end{align*} symbol for null hypothesis

\begin{align*}\mu =\end{align*} population mean

\begin{align*}1,100 =\end{align*} value to be tested

We test the null hypothesis against an **alternative hypothesis,** which is given the symbol \begin{align*}H_{a}\end{align*} and includes the outcomes not covered by the null hypothesis. Basically, the alternative hypothesis states that there is a difference between the hypothesized population mean and the sample mean. The alternative hypothesis can be supported only by rejecting the null hypothesis. In our example above about the SAT scores of graduating seniors, our alternative hypothesis would state that there is a difference between the null and alternative hypotheses or:

\begin{align*}H_{a}: \mu\neq 1100\end{align*}

Let’s take a look at a couple of examples and develop a few null and alternative hypotheses.

**Example:**

We have a medicine that is being manufactured and each pill is supposed to have \begin{align*}14\;\mathrm{milligrams}\end{align*} of the active ingredient. What are our null and alternative hypotheses?

**Solution:**

\begin{align*}& H_{0}: \mu=14\\ & H_{a}: \mu\neq14 \end{align*}

Our null hypothesis states that the population has a mean equal to \begin{align*}14\;\mathrm{milligrams.}\end{align*} Our alternative hypothesis states that the population has a mean that is different than \begin{align*}14\;\mathrm{milligrams.}\end{align*}

**Example:**

The school principal wants to test if it is true what teachers say -- that high school juniors use the computer an average \begin{align*}3.2\;\mathrm{hours}\end{align*} a day. What are our null and alternative hypotheses?

**Solution:**

\begin{align*}& H_{0}: \mu=3.2 \\ & H_{a}: \mu\neq 3.2\end{align*}

Our null hypothesis states that the population has a mean equal to \begin{align*}3.2\;\mathrm{hours.}\end{align*} Our alternative hypothesis states that the population has a mean that differs from \begin{align*}3.2\;\mathrm{hours.}\end{align*}

## Deciding Whether to Reject the Null Hypothesis: Single and Two-Tailed Hypothesis Tests

When a hypothesis is tested, a statistician must decide on how much evidence is necessary in order to reject the null hypothesis. For example, if the null hypothesis is that the average height of a population is \begin{align*}64\;\mathrm{inches,}\end{align*} a statistician wouldn't measure one person who is \begin{align*}66\;\mathrm{inches}\end{align*} and reject the hypothesis based on that one trial. It is too likely that the discrepancy was merely due to chance. Statisticians first choose a **level of significance** or **alpha \begin{align*}(\alpha)\end{align*} level**, which is an event probability below which discrepancies from the null hypothesis are deemed significant. The most frequently used levels of significance are \begin{align*}0.05\end{align*} and \begin{align*}0.01.\end{align*} In other words, these levels mean that when we make the decision to reject the null hypothesis, we are correct \begin{align*}95\end{align*} or \begin{align*}99\;\mathrm{percent}\end{align*} of the time. The areas outside of these levels of significance are called the **critical regions**. When choosing the level of significance, we need to consider the consequences of rejecting or failing to reject the null hypothesis. If there is the potential for health consequences (as in the case of active ingredients in prescription medications) or great cost (as in the case of manufacturing machine parts), we should use a more ‘conservative’ critical region with levels of significance such as \begin{align*}.005\end{align*} or \begin{align*}.001.\end{align*}

When determining the critical regions for a **two-tailed** hypothesis test, the level of significance represents the extreme areas under the normal density curve. We call this a two-tailed hypothesis test because the critical region is located in both ends of the distribution. For example, if there was a significance level of \begin{align*}0.95,\end{align*} the critical region would be the most extreme \begin{align*}5\;\mathrm{percent}\end{align*} under the curve with \begin{align*}2.5\;\mathrm{percent}\end{align*} on each tail of the distribution.

Therefore, if the mean from sample taken from the population falls within these critical regions, we would conclude that there was too much of a difference and we would reject the null hypothesis. However, if the mean from the sample falls in the middle of the distribution (in between the critical regions) we would fail to reject the null hypothesis.

We calculate the critical region for the single-tail hypothesis test a bit differently. We would use a single-tail hypothesis test when the direction of the results is anticipated or we are only interested in one direction of the results. For example, a single-tail hypothesis test may be used when evaluating whether or not to adopt a new textbook. We would only decide to adopt the textbook if it improved student achievement relative to the old textbook. A single-tail hypothesis simply states that the mean is greater or less than the hypothesized value.

When performing a single-tail hypothesis test, our alternative hypothesis looks a bit different. When developing the alternative hypothesis in a single-tail hypothesis test we would use the symbols of greater than or less than. Using our example about SAT scores of graduating seniors, our null and alternative hypothesis could look something like:

\begin{align*}& H_{0}: \mu=1100 \\ & H_{a}:\mu\neq 1100\end{align*}

In this scenario, our null hypothesis states that the mean SAT scores would be equal to \begin{align*}1,100\end{align*} while the alternate hypothesis states that the SAT scores would be greater than \begin{align*}1,100.\end{align*} A single-tail hypothesis test also means that we have only one critical region because we put the entire region of rejection into just one side of the distribution. When the alternative hypothesis is that the sample mean is greater, the critical region is on the right side of the distribution. When the alternative hypothesis is that the sample is smaller, the critical region is on the left side of the distribution (see below).

To calculate the critical regions, we must first find the **critical values** or the cut-offs where the critical regions start. To find these values, we use the critical values found specified by the \begin{align*}z\end{align*}**-distribution**. These values can be found in a table that lists the areas of each of the tails under a normal distribution. Using this table, we find that for a \begin{align*}0.05\end{align*} significance level, our critical values would fall at \begin{align*}1.96\end{align*} standard errors above and below the mean. For a \begin{align*}0.01\end{align*} significance level, our critical values would fall at \begin{align*}2.57\end{align*} standard errors above and below the mean. Using the \begin{align*}z\end{align*}-distribution we can find critical values (as specified by standard \begin{align*}z\end{align*} scores) for any level of significance for either single- or two-tailed hypothesis tests.

**Example:**

Use the \begin{align*}z\end{align*}-distribution table to determine the critical value for a single-tailed hypothesis test with a \begin{align*}0.05\end{align*} significance level.

**Solution:**

Using the \begin{align*}z\end{align*}-distribution table, we find that a significance level of \begin{align*}0.05\end{align*} corresponds with a critical value of \begin{align*}1.645.\end{align*}

## Calculating the Test Statistic

Before evaluating our hypotheses by determining the critical region and calculating the test statistic, we need to first:

- Confirm that the distribution is normal.
- Determine the hypothesized mean \begin{align*}(\mu)\end{align*} of the distribution.
- If we don’t have the population variance, we will need to calculate the standard deviation of the sample so that we can calculate the
**standard error of the mean**\begin{align*}(\sigma_X)\end{align*}.

Remember that since we have a random sample from the population, we do not expect the sample mean to be *exactly* equal to the hypothesized value of the population mean. Therefore, the question really is: “How different can the observed sample mean be from the hypothesized mean before rejecting the null hypothesis?” Or, in other words, “If the null hypothesis is true, is it likely that we will obtain such an observed sample mean?” We use our critical values taken from the \begin{align*}z\end{align*}-distribution to determine those cutoffs.

To evaluate the sample mean against the hypothesized population mean, we use the concept of \begin{align*}z\end{align*} -scores to determine how different the two means are from each other. As we learned in previous lessons, the \begin{align*}z\end{align*}-score is calculated by using the formula:

\begin{align*}z = \frac {(\bar {X} - \mu )}{\sigma_X}\end{align*}

where:

\begin{align*}z =\end{align*} standardized score

\begin{align*}\bar {X}=\end{align*} sample mean

\begin{align*}\mu =\end{align*} hypothesized population mean

\begin{align*}\sigma_X =\end{align*} standard error?. If we do not have the population variance, we can estimate the deviation of the samples from the true population mean by dividing the standard deviation by the square root of the number of observations \begin{align*}\left ( \frac{\sigma} {\sqrt {n}}\right )\end{align*}.

Once we calculate the \begin{align*}z\end{align*}-score, we can make a decision about whether to reject or to fail to reject the null hypothesis based on the critical values.

Let’s calculate the test statistic for several different scenarios.

**Example:**

College A has an average SAT score of \begin{align*}1,500.\end{align*} From a random sample of \begin{align*}125\end{align*} freshman psychology students we find the average SAT score to be \begin{align*}1,450\end{align*} with a standard deviation of \begin{align*}100.\end{align*} We want to know if these freshman psychology students are representative of the overall population. What are our hypotheses and the test statistic?

**Solution:**

Let’s first develop our null and alternative hypotheses:

\begin{align*}& H_{0}: \mu=1500 \\ & H_{a}: \mu\neq 1500\end{align*}

Our standard \begin{align*}z\end{align*}-score for the sample of freshman psychology students would be:

\begin{align*}z = \frac{\bar {X} - \mu} {\sigma_x} = \frac{1450 -1500} {100/\sqrt{125}} \approx -5.59\end{align*}

**Example:**

A farmer is trying out a planting technique that he hopes will increase the yield on his pea plants. Over the last \begin{align*}5\;\mathrm{years,}\end{align*} the average number of pods on one of his pea plants was \begin{align*}145\;\mathrm{pods}\end{align*} with a standard deviation of \begin{align*}100\;\mathrm{pods.}\end{align*} This year, after trying his new planting technique, he takes a random sample of his plants and finds the average number of pods to be \begin{align*}147.\end{align*} He wonders whether or not this is a statistically significant increase. What is his hypotheses and the test statistic?

**Solution:**

First, we develop our null and alternative hypotheses:

\begin{align*}H_{0}:\mu=145\end{align*}

Let’s calculate the test statistic for several different scenarios.

**Example:**

\begin{align*}H_{a}: \mu>145\end{align*}

This alternative hypothesis is \begin{align*}>\end{align*} since we are only concerned with the pod *gain* which translates to above the mean.

Next, we calculate the standard \begin{align*}z\end{align*}-score for the sample of pea plants.

\begin{align*}z = \frac{\bar {X}-\mu} {\sigma_X} = \frac{147-145} {100/\sqrt {144}} = 0.24\end{align*}

In the following lessons, we will use these standard \begin{align*}z\end{align*}-scores and the critical regions to evaluate the null and the alternative hypotheses.

## Testing the P-Value of an Event

We can also evaluate a hypothesis by testing the probability, or the P-value, of an event occurring. When we assume that we have normal distributions, we can determine approximately where on the normal distribution that the sample mean will fall. When we know where it falls, we can determine the **probability** of obtaining a sample value either greater or smaller than the mean by using the \begin{align*}z\end{align*}-score.

Let’s use the example about the pea farmer. As we mentioned, the farmer is wondering if the number of pea pods per plant has gone up with his new planting technique and finds that out of a sample of \begin{align*}144\;\mathrm{peas}\end{align*} there is an average number of \begin{align*}147\;\mathrm{pods}\end{align*} per plant (compared to a previous average of \begin{align*}145\;\mathrm{pods}\end{align*}). But the farmer is really hoping that some plants have a more dramatic yield increase. What is the probability of a plant having a much higher yield of over \begin{align*}155\;\mathrm{pea\ pods}\end{align*}?

To find this probability, first find the \begin{align*}z\end{align*}-score for the hypothesized sample mean using the formula that we learned in the section above. Therefore, a \begin{align*}z\end{align*}-score for a sample of plants with \begin{align*}155\;\mathrm{pods}\end{align*} would be:

\begin{align*}z = \frac {\bar {X}-\mu}{\sigma_X} = \frac {155-145}{100 /\sqrt {144}} = 1.20\end{align*}

Using the \begin{align*}z\end{align*}-score distribution, we find that the area beyond a \begin{align*}z\end{align*}-score of \begin{align*}1.20\end{align*} is equal to \begin{align*}.1151.\end{align*} This means that there is \begin{align*}.1151\end{align*} or \begin{align*}11.5 \%\end{align*} chance that a pea plant will produce over \begin{align*}155\;\mathrm{pods.}\end{align*}

## Type I and Type II Errors

When we decide to reject or not reject the null hypothesis, we have four possible scenarios:

- A true hypothesis is rejected.
- A true hypothesis is not rejected.
- A false hypothesis is not rejected.
- A false hypothesis is rejected.

If a hypothesis is true and we do not reject it (Option 2) or if a false hypothesis is rejected (Option 4), we have made the correct decision. But if we reject a true hypothesis (Option 1) or a false hypothesis is not rejected (Option 3) we have made an error. Overall, one type of error is not necessarily more serious than the other. Which type is more serious depends on the specific research situation, but ideally both types of errors should be minimized during the analysis.

Decision Made | Null Hypothesis is True | Null Hypothesis is False |
---|---|---|

Reject Null Hypothesis | Type I Error | Correct Decision |

Do not Reject Null Hypothesis | Correct Decision | Type II Error |

The general approach to hypothesis testing focuses on the **Type I** error: rejecting the null hypothesis when it may be true. The level of significance, also known as the alpha level, is defined as the probability of making a Type I error when testing a null hypothesis. For example, at the \begin{align*}0.05\end{align*} level, we know that the decision to reject the hypothesis may be incorrect \begin{align*}5\;\mathrm{percent}\end{align*} of the time.

Calculating the probability of making a **Type II** error (?) is not as straightforward as calculating a Type I error. The probability of making a Type II error can only be determined when values have been specified for both the alternative hypothesis and the null hypothesis. Once the value for the alternative hypothesis has been specified, it is possible to determine the probability of making a correct decision (1- ?). This quantity, 1- ?, is called the **power of the test** and is discussed in the next section.

As mentioned, our goal is to minimize the potential of both Type I and Type II errors. However, there is a relationship between these two types of errors. As the level of significance or alpha level (?) increases, the probability of making a Type II error (?) decreases and vice versa. While ? is under our direct control, ? is not. We will look at this relationship a bit more in depth in the next section.

Often we establish the alpha level based on the severity of the consequences of making a Type I error. If the consequences are not that serious, we could set an alpha level at \begin{align*}0.10\end{align*} or \begin{align*}0.20.\end{align*} However, in a field like medical research we would set the alpha level very low (at \begin{align*}0.001\end{align*} for example) if there was potential bodily harm to patients. We can also attempt minimize the Type II errors by setting higher alpha levels in situations that do not have grave or costly consequences.

## Calculating the Power of a Test

The **power of a test** is defined as the probability of rejecting the null hypothesis when it is false (making the correct decision). Obviously, we want to maximize this power if we are concerned about making Type II errors. To determine the power of the test, there must be a specified value for the alternative hypothesis which is specified much in the same way as we specify the value in the null hypothesis. For example, suppose that a doctor is concerned about making a Type II error only if the active ingredient in the new medication is less than \begin{align*}3\;\mathrm{milligrams}\end{align*} higher than what was specified in the null hypothesis (say, \begin{align*}250\;\mathrm{milligrams}\end{align*} with a sample of \begin{align*}200\end{align*} and a standard deviation of \begin{align*}50\end{align*}). Now we have values for both the null and the alternative hypotheses.

\begin{align*}& H_{0}: \mu=250 \\ & H_{a}: \mu=253\end{align*}

By specifying a value for the alternative hypothesis, we have selected one of the many values for \begin{align*}H_{a}\end{align*}. In determining the power of the test, we must assume that \begin{align*}H_a\end{align*} is true and determine whether we would correctly reject the null hypothesis. In other words, we want to determine the power of our test for detecting this difference. In this example, we may choose a certain dosage if there were medical repercussions above that level.

We want to find the area under the curve that is associated with making a Type II error. In the example above, this means that we need to find the power that the test has for detecting this difference. Calculating the exact value for the power of the test requires determining the area above the critical value set up to test the null hypothesis when it is re-centered around the alternative hypothesis. Say that we have an alpha level of \begin{align*}.05\end{align*} – we would then have a critical value of \begin{align*}1.64\end{align*} for the single-tailed test which would have a value of:

\begin{align*}z & = \frac {\bar X - \mu}{\sigma_X} \\ 1.64 & = \frac {\bar {X}-250}{50/\sqrt {200}} \\ \bar {X} & = 1.64 \left (\frac{50} {\sqrt{200}}\right ) + 250 \approx 255.8\end{align*}

Now, with a new mean set at the alternative hypothesis \begin{align*}(H_a: \mu=253)\end{align*} we want to find the value of the critical score \begin{align*}(255.8)\end{align*} when centered around this score. Therefore, we can figure that:

\begin{align*}z = \frac {\bar X - \mu}{\sigma_X} = \frac {255.8-253}{3.54} \approx 0.79 \end{align*}

Using the standard \begin{align*}z\end{align*} distribution we find that the area to the right of a \begin{align*}z\end{align*}-score of \begin{align*}.79\end{align*} is \begin{align*}.2148.\end{align*} This means that since we assumed the alternative hypothesis to be true, there is only a \begin{align*}21.5 \%\end{align*} chance of rejecting the null hypothesis. The power of this test is about \begin{align*}0.215.\end{align*} In other words, this test of the null hypothesis is not very powerful and has only a \begin{align*}0.215\end{align*} probability of detecting the real difference between the means.

There are several things that affect the power of a test including:

- Whether the alternative hypothesis is a single-tailed or two-tailed test.
- The level of significance \begin{align*}(\alpha)\end{align*}.
- The sample size.

## Lesson Summary

- Hypothesis testing involves making educated guesses about a population based on a sample drawn from the population. We generate null and alternative hypotheses based on the mean of the population to test these guesses.
- We establish critical regions based on level of significance or alpha \begin{align*}(\alpha)\end{align*} levels. If the value of the test statistic falls in these critical regions, we are able to reject it.
- To evaluate the sample mean against the hypothesized population mean, we use the concept of \begin{align*}z\end{align*}-scores to determine how different the two means are.
- When we make a decision about a hypothesis, there are four different outcome and possibilities and two different types of errors. A Type I error is when we reject the null hypothesis when it is true and a Type II error is when we do not reject the null hypothesis, even when it is false.
- The power of a test is defined as the probability of rejecting the null hypothesis when it is false (in other words, making the correct decision). We determine the power of a test by assigning a value to the alternative hypothesis and using the \begin{align*}z\end{align*}-score to calculate the probability of making a Type II error.

## Review Questions

- If the difference between the hypothesized population mean and the mean of the sample is large, we ___ the null hypothesis. If the difference between the hypothesized population mean and the mean of the sample is small, we ___ the null hypothesis.
- At the Chrysler manufacturing plant, there is a part that is supposed to weigh precisely \begin{align*}19\;\mathrm{pounds.}\end{align*} The engineers take a sample of parts and want to know if they meet the weight specifications. What are our null and alternative hypotheses?
- In a hypothesis test, if difference between the sample mean and the hypothesized mean divided by the standard error falls in the middle of the distribution and in between the critical values, we ___ the null hypothesis. If this number falls in the critical regions and beyond the critical values, we ___ the null hypothesis.
- Use the \begin{align*}z\end{align*}-distribution table to determine the critical value for a single-tailed hypothesis test with a \begin{align*}0.01\end{align*} significance level.
- Sacramento County high school seniors have an average SAT score of \begin{align*}1,020.\end{align*} From a random sample of \begin{align*}144\end{align*} Sacramento High School students we find the average SAT score to be \begin{align*}1,100\end{align*} with a standard deviation of \begin{align*}144.\end{align*} We want to know if these high school students are representative of the overall population. What are our hypotheses and the test statistic?
- During hypothesis testing, we use the \begin{align*}P\end{align*}-value to predict the ___ of an event occurring.
- A survey shows that California teenagers have an average of \begin{align*}\$ 500\end{align*} in savings \begin{align*}(\mathrm{standard\ error} = 100)\end{align*}. What is the probability that a randomly selected teenager will have savings greater than \begin{align*}\$ 520\end{align*}?
- Please fill in the types of errors missing from the table below:

Decision Made | Null Hypothesis is True | Null Hypothesis is False |
---|---|---|

Reject Null Hypothesis | (1) ___ | Correct Decision |

Do not Reject Null Hypothesis | Correct Decision | (2) ___ |

- The __ is defined as the probability of rejecting the null hypothesis when it is false (making the correct decision). We want to maximize__if we are concerned about making Type II errors.
- The Governor’s economic committee is investigating average salaries of recent college graduates in California. They decide to test the null hypothesis that the average salary is \begin{align*}\$ 24,500\end{align*} (standard deviation is \begin{align*}\$ 4,800\end{align*}) and is concerned with making a Type II error only if the average salary is
*less*than \begin{align*}\$ 25,100.\end{align*} \begin{align*}(H_{a}: \mu=\$25,100)\end{align*}. For an \begin{align*}\alpha =.05\end{align*} and a sample of \begin{align*}144,\end{align*} determine the power of a one-tailed test.

## Review Answers

- Reject, Fail to Reject
- \begin{align*}H_{0}: \mu =19\end{align*}, \begin{align*}H_{a}: \mu\neq 19\end{align*}
- Fail to Reject, Reject
- \begin{align*}Z = 2.325\end{align*}
- \begin{align*}H_{0}: \mu=1020, H_{a}:\mu\neq 1020, Z = 6.67\end{align*}
- Probability
- Area beyond a \begin{align*}z\end{align*}-score of \begin{align*}0.20 = .4207.\end{align*} Therefore, there is a probability of \begin{align*}42.07 \%\end{align*} that a teenager will have savings greater than \begin{align*}\$ 520.\end{align*}
- Type I error, Type II error
- Power of the Test
- \begin{align*}0.44\end{align*}