<meta http-equiv="refresh" content="1; url=/nojavascript/">
You are viewing an older version of this Concept. Go to the latest version.

# Chi-Square Test

## Closeness of observed data to expected data of the model

%
Progress
Practice Chi-Square Test
Progress
%
Chi Squared Statistic

#### Objective

Here you will learn how to use a Chi-Squared statistic to evaluate the fit of a hypothesized distribution. This is known as a Goodness of Fit test.

#### Concept

Suppose you wanted to evaluate a recent statistic stating that iOS represents 32% and Android 51% of active smart phones. You would like to know if the statistic actually reflects the distribution of phones among your friends. How could you evaluate the data you collect to see if it supports this hypothesis?

Look to the end of the lesson for the answer.

#### Watch This

http://youtu.be/b3o_hjWKgQw statslectures – Chi-Square Test for Goodness of Fit

#### Guidance

The Greek letter “chi”, written as $\chi$ , is the symbol used to identify a chi-square statistic , which we will use here to evaluate how well a set of observed data fits a corresponding expected set.

Conducting a  Chi-Square test is much like conducting a Z -test or T -test as we did in Chapter 10. We will follow the same basic series of steps and compare a calculated value to a chart to evaluate the probability of getting the results we have if the null hypothesis is true, just as we did with the Z and F tests. Additionally, as was the case with the F -testing, we will be evaluating the number of degrees of freedom , and choosing values from a chart based on the number.

The primary difference between a Chi-Square test and the tests we have work with before is that previous tests have all been primarily dedicated to comparing single parameters, whereas Chi-Square tests are used to determine if two random variables are independent or related and so deal with multiple values for each variable. Additionally, the Chi-Square statistic is useful for looking at categorical data rather than quantitative data.

The Chi-Square statistic is actually pretty straightforward to calculate:

$\chi^2=\sum \frac{(observed - expected)^2}{expected}$

Example A

The American Pet Products Association conducted a survey in 2011 and determined that 60% of dog owners have only one dog, 28% have two dogs, and 12% have three or more. Supposing that you have decided to conduct your own survey and have collected the data below, determine whether your data supports the results of the APPA study. Use a significance level of 0.05.

Data: Out of 129 dog owners, 73 had one dog and 38 had two dogs.

Solution:

• Step 1: Clearly state the null and alternative hypotheses

$H_0$ : The survey agrees with the sample .

$H_1$ : The survey does not agree with the sample .

• Step 2: Identify an appropriate test and significance level

Since we are comparing two sets of data, and not just a single value, a Chi-Square test is appropriate. In the absence of a stated significance level in the problem, we assume the default 0.05.

• Step 3: Analyze sample data

Create a table to organize data and compare the observed data to the expected data:

 One Dog Two Dogs 3+ Dogs TOTAL Observed 73 38 18 129 Expected

To identify the expected values, multiply the expected % by the total number observed:

 One Dog Two Dogs 3+ Dogs TOTAL Observed $73$ $38$ $18$ $129$ Expected $0.60 \times 129=77.4$ $0.28 \times 129=36.1$ $0.12 \times 129=15.5$ $129$

To calculate our chi-square statistic, we need to sum the squared difference between each observed and expected value divided by the expected value:

$\chi^2 &=\sum \frac{(observed - expected)^2}{expected} \\\chi^2 &=\frac{(73 - 77.4)^2}{77.4} + \frac{(38 - 36.1)^2}{36.1} + \frac{(18 - 15.5)^2}{15.5} \\\chi^2 &=\frac{(-4.4)^2}{77.4} + \frac{(1.9)^2}{36.1} + \frac{(2.5)^2}{15.5} \\\chi^2 &=\frac{19.36}{77.4} + \frac{3.61}{36.1} + \frac{6.25}{15.5} \\\chi^2 &=0.2501 + 0.1000 + 0.4032 \\\chi^2 &=0.7533$

Now that we have our chi-square statistic, we need to compare it to the chi-square value for the significance level 0.05. We can use a reference table such as the one below, or a chi-square value calculator . Just as with the T -tests in Chapter 10, we will need to know the  degrees of freedom , which equal the number of observed category values minus one. In this case, there are three category values: one dog, two dogs, and three or more dogs. The degrees for freedom, therefore, are  $3 - 1 = 2$ .

Using the calculator or the table, we find that the critical value for a 0.05 significance level with  $df = 2$ is 5.9915. That means that 95 times out of 100, a survey that agrees with a sample will have a $\chi ^2$ critical value of 5.9915 or less. If our chi-square value is greater than 5.9915, then the measurements we took only occur 5 or fewer times out of 100, or the null hypothesis is incorrect. Our chi-square statistic is only 0.7533 , so we will  not reject the null hypothesis.

• Step 4: Interpret the results

Since our chi-square statistic was less than the critical value, we do not reject the null hypothesis, and we can say that our survey data does support the data from the APPA.

Example B

Rachel told Eric that the reason her car insurance is less expensive is that female drivers get in fewer accidents than male drivers. Specifically, she says that male drivers are held responsible in 65% of accidents involving drivers under 23.

If Eric does some research of his own and discovers that 46 out of the 85 accidents he investigates involve male drivers, does his data support Rachel’s hypothesis?

Solution:

• Step 1: Clearly state the null and alternative hypotheses

$H_0$ : The survey agrees with the sample .

$H_1$ : The survey does not agree with the sample .

• Step 2: Identify an appropriate test and significance level

Since we are comparing two sets of data, and not just a single value, a Chi-Square test is appropriate. In the absence of a stated significance level in the problem, we assume the default 0.05.

• Step 3: Analyze sample data

Create a table to organize data and compare the observed data to the expected data:

 Male Drivers Female Drivers TOTAL Observed 46 39 85 Expected

To identify the expected values, multiply the expected % by the total number observed:

 Male Drivers Female Drivers TOTAL Observed $46$ $39$ $85$ Expected $0.65 \times 85=55.25$ $0.35 \times 85=29.75$ $85$

To calculate our chi-square statistic, we need to sum the squared differences between each observed and expected value divided by the expected value:

$\chi^2 &= \sum \frac{(observed - expected)^2}{expected} \\\chi^2 &= \frac{(46 - 55.25)^2}{55.25} + \frac{(39 - 29.75)^2}{29.75} \\\chi^2 &= \frac{(-9.25)^2}{55.25} + \frac{(9.25)^2}{29.75} \\\chi^2 &= \frac{85.5625}{55.25} + \frac{85.5625}{29.75} \\\chi^2 &= 1.5486 + 2.8760 \\\chi^2 &= 4.4246$

Now that we have our chi-square statistic, we need to compare it to the chi-square critical value for 0.05 with one degree of freedom , since we have two categories. Using the chi-square value calculator , we find the critical value to be 3.8414. The critical value indicates that only 0.05, or 5%, of values would be as high as 3.8414. If the  $\chi ^2$ of our data is greater than 3.8414, then fewer than 5 times out of 100 would we expect to get that result if the null hypothesis is true.

• Step 4: Interpret your results

Our calculated data value of  $\chi^2 = 4.4246$ is greater than the 0.05 significance level critical value of 3.8141, so we reject the null hypothesis. The data that Eric observed does not support the distribution that Rachel claimed.

Example C

The online car magazine “ Camaro5.com ” claims that 51% of Ford Mustang or Chevy Camaro owners own Camaros. Ellen is a Mustang lover and decides to do some research. If Ellen collects the data below, does her data support the magazine’s claim?

Data: Mustang owners: 28, Camaro owners: 34

Solution:

• Step 1: Clearly state the null and alternative hypotheses

$H_0$ : The survey agrees with the sample .

$H_1$ : The survey does not agree with the sample .

• Step 2: Identify an appropriate test and significance level

Since we are comparing two sets of data, and not just a single value, a Chi-Square test is appropriate. In the absence of a stated significance level in the problem, we assume the default 0.05.

• Step 3: Analyze sample data

We will start by creating a table to organize our data:

 Mustang Camaro TOTAL Observed $28$ $34$ $62$ Expected $0.49 \times 62=30.4$ $0.51 \times 62=31.6$ $62$

Now we can calculate our chi statistic:

$\chi^2 &= \sum \frac{(observed - expected)^2}{expected} \\\chi^2 &= \frac{(28 - 30.4)^2}{30.4} + \frac{(34 - 31.6)^2}{31.6} \\\chi^2 &= \frac{(-2.4)^2}{30.4} + \frac{(2.4)^2}{31.6} \\\chi^2 &= .3718$

The chi-square critical value for  $df=1$ and a significance level of 0.05 is 3.8414 (the same as in Example B).

• Step 4: Interpret your results

Our calculated data value of  $\chi^2=0.3718$ is significantly less than the 0.05 significance level critical value of 3.8141, so we fail to reject the null hypothesis. This means that, unfortunately for Ellen, her research did not allow her to deny the claim that Camaros are more popular.

##### Concept Problem Revisited

Suppose you wanted to evaluate a recent statistic stating that iOS represents 32% and Android 51% of active smart phones. You would like to know if the statistic actually reflects the distribution of phones among your friends. How could you evaluate the data you collect to see if it supports this hypothesis?

You could evaluate the hypothesis by collecting data from a SRS of cell phone owners and using a chi-square test to see if your data supports the hypothesis.

#### Vocabulary

A chi-square statistic is a derived value used in a chi-square test to calculate the probability that a given distribution is a good fit for observed data.

The degrees of freedom of a variable are the number of values in the final calculation of a statistic that are free to vary. The degrees of freedom are calculated as $n-1$ , where  $n$ is the number of samples or categories in the variable.

#### Guided Practice

Questions 1-5 refer to the following data:

Tuscany claims that 70% of dog or cat owners own a dog, and 30% own a cat. Sayber decides to test her claim and learns that 23 of the 40 people he asks own dogs, and 17 own cats.

1. What kind of test could you use to see if Sayber’s data supports Tuscany’s claim?
2. What would be the null and alternative hypotheses?
3. What would be the expected values of dog and cat owners?
4. What is the chi-square statistic of the observed data?
5. Assuming a 0.1 significance level, does Sayber’s data support Tuscany’s claim?

Solutions :

1. A chi-square test would be appropriate.

2. The null hypothesis, $H_0$ , would be that the research does support the hypothesis, the alternative hypothesis would be that it does not.

3. The expected number of dog owners, according to Tuscany’s claim, would be 70% of the 40 people that Sayber polled, or 28 dog owners. The expected number of cat owners would be 30% of the 40 people polled, or 12.

4. The  $\chi ^2$ statistic is the sum of the squared differences between the observed and expected values, divided by the expected values:

$\chi^2 &= \frac{(23 - 28)^2}{28} + \frac{(17 - 12)^2}{12} \\&= \frac{25}{28} + \frac{25}{12} \\\chi^2 & =2.9762$

5. The critical value of chi-squared for 1 degree of freedom at a significance level of 0.1 is 2.705. Since the chi-square statistic we calculated is 2.9762, and is therefore more extreme than the critical value, we may reject the hypothesis , and say that Sayber’s data does not support Tuscany’s claim.

#### Practice

Questions 1-5 refer to the following:

Evan claims that 15% of computer gamers have played “Team Fortress 2”, and 35% have played “World of Warcraft”. Evan’s brother is skeptical of those figures and decides to do some research. He discovers that 60 of the 200 computer gamers he polls have played “Team Fortress 2”, and 90 have played “World of Warcraft”.

1. Create a table to organize the data and prepare for hypothesis testing.

2. What sort of test would be appropriate to determine if the observed data supports Evan’s claim?

3. What would be  $H_0$ and $H_1$ ?

4. What would be the  $\chi ^2$ statistic for the observed data?

5. How many degrees of freedom are there in the variable “played game”?

6. Assuming a significance level of 0.05, what is the  $\chi ^2$ critical value?

7. Does the observed data support Evan’s claim? Explain your findings.

Questions 8-15 refer to the following:

Mack claims that 84% of street racers drive import cars, and 16% drive domestic muscle cars. Abbi likes domestic cars and thinks Mack is overstating the percentage of imports, so she does some research of her own and finds that 57 of the street racers she interviewed drive imports, and 31 drive American muscle.

8. Create a table to organize the data and prepare for hypothesis testing.

9. What sort of test would be appropriate to determine if the observed data supports Mack’s claim?

10. What would be  $H_0$ and $H_1$ ?

11. What would be the  $\chi ^2$ statistic for the observed data?

12. How many degrees of freedom are there in the variable “played game”?

13. Assuming a significance level of 0.10, what is the $\chi ^2$ critical value?

14. Does the data indicate that Abbi should reject, or fail to reject  $H_0$ ?

### Vocabulary Language: English

chi-squared distribution

chi-squared distribution

The distribution of the chi-square statistic is called the chi-square distribution.
chi-squared goodness of fit test

chi-squared goodness of fit test

The chi-square goodness of fit test can be used to estimate how closely an observed distribution matches an expected distribution.
chi-squared statistic

chi-squared statistic

The chi-squared statistic (X^2) is used to evaluate how well a set of observed data fits a corresponding expected set.
chi-squared test

chi-squared test

The chi-squared test calculates the probability that a given distribution is a good fit for observed data.
contingency tables

contingency tables

A contingency table (two-way table) is used to organize data from multiple categories of two variables so that various assessments may be made.
degrees of freedom

degrees of freedom

Degrees of freedom are essentially the number of samples that have the ‘freedom’ to change without necessarily affecting the sample mean. Degrees of freedom has the formula df = n - 1.
test for independence

test for independence

The test for independence is used when estimating if two random variables are independent of one another.
test of significance

test of significance

A test of significance (calculating a z-score or a t-statistic) is done when a claim is made about the value of a population parameter.